7  Nonparametric Statistical Tests

7.1 When to use non-parametric methods

  • Short answer: Good default when \(P\)-values are needed and there are no covariates to adjust for
  • Nonparametric methods are those not requiring one to assume a certain distribution for the raw data
    • In contrast, parametric methods assume data come from some underlying distribution
    • \(t\)-tests assume the data come form a Gaussian distribution
  • Response variable ordinal or interval
  • For ordinal responses nonparametric methods are preferred because they assume no spacing between categories
  • No problem in using nonparametric tests on interval data
    • if normality holds, nonpar. test 0.95 efficient, i.e., has about the same power as the parametric test done on 0.95 of the observations1
    • if normality does not hold, nonpar. tests can be arbitrarily more efficient and powerful than the corresponding parametric test
    • an elegant and non-arbitrary way to deal with extreme values or outliers
    • rank-based nonparametric tests give the analyst freedom from having to choose the correct transformation of the measurement (as long as the optimum transformation is monotonic)
  • Nonparametric methods are robust, many parametric methods are not
    • Example: \(t\)-test comparing two sets of measurements

1 The large-sample efficiency of the Wilcoxon and Spearman tests compared to \(t\) and \(r\) tests is \(\frac{3}{\pi} = 0.9549\).

Group 1 Group 2 Means \(P\)
1 2 3 4 5 6 7 8 9 10 7 8 9 10 11 12 13 14 15 16 17 18 19 20 5.5 and 13.5 0.000019
1 2 3 4 5 6 7 8 9 10 7 8 9 10 11 12 13 14 15 16 17 18 19 20 200 5.5 and 25.9 0.12

The SD is a particularly non-robust statistical estimator.

  • Example: Fecal calprotectin being evaluated as a possible biomarker of disease severity (Figure 7.1)
    • Calprotectin has an upper detection limit
    • Median can be calculated (mean cannot)
  • If all you want is a \(P\)-value nonpar. tests are preferred
    • Especially if response is univariate and no need to adjust for covariates
  • Pre-testing for normality and deciding nonparametric vs. parametric analysis is a bad idea
    • Tests for normality do not have a power of 1.0 and type I error of 0.0
    • Leads to temptation, e.g., an investigator might “forget” to do the test of normality if the \(t\)-test is significant
    • Doesn’t acknowledge that nonparametric tests are very efficient even under normality
    • Pre-testing for normality alters the type I error and confidence interval coverage
  • A drawback is that nonpar. tests do not correspond to usual confidence limits for effects
    • E.g., a CL for the difference in 2 means may include zero whereas the Wilcoxon test yields \(P=0.01\)

    • Point estimate that exactly corresponds to the Wilcoxon two-sample test is the Hodges-Lehman estimate of the location difference

      • median of all possible differences between a measurement from group 1 and a measurement from group 2
  • Nonparametric tests are often obtained by replacing the data with ranks across subjects and then doing the parametric test
  • Many nonpar. tests give the same \(P\)-value regardless of how the data are transformed; a careful choice of transformation (e.g., log) must sometimes be used in the context of parametric tests
  • \(P\)-values computed using e.g. the \(t\) distribution are quite accurate for nonparametric tests
  • In case of ties, midranks are used, e.g., if the raw data were 105 120 120 121 the ranks would be 1 2.5 2.5 4
Parametric Test Nonparametric Counterpart Semiparametric Model Counterpart
1-sample \(t\) Wilcoxon signed-rank
2-sample \(t\) Wilcoxon 2-sample rank-sum Proportional odds
\(k\)-sample ANOVA Kruskal-Wallis Proportional odds
Pearson \(r\) Spearman \(\rho\)

7.2 One Sample Test: Wilcoxon Signed-Rank

  • Almost always used on paired data where the column of values represents differences (e.g., post-pre) or log ratios
  • The sign test is the simplest test for the median difference being zero in the population
    • it just counts the number of positive differences after tossing out zero differences
    • tests \(H_{0}:~\)Prob\([x>0]=\frac{1}{2}\), i.e., that it is equally likely in the population to have a value below zero as it is to have a value above zero
    • as it ignores magnitudes completely, the test is inefficient
  • By contrast, with the much more powerful Wilcoxon signed rank one-sample test, ranks of absolute differences are given the sign of the original difference
  • Magnitudes of raw data matter more here than with the Wilcoxon 2-sample test
  • Example: A crossover study in which the treatment order is randomized
    Data arranged so that treatment A is in the first column, no matter which order treatment A was given
A B B-A Rank \(|\mathrm{B-A}|\) Signed Rank
5 6 1 1.5 1.5
6 5 -1 1.5 -1.5
4 9 5 4.0 4.0
7 9 2 3.0 3.0
  • A good approximation to an exact \(P\)-value may be obtained by computing \[z = \frac{\sum{SR_{i}}}{\sqrt{\sum{SR_{i}^{2}}}},\] where the signed rank for observation \(i\) is \(SR_{i}\). This formula already takes ties into account without using Rosner’s messy Eq. 9.5. We look up \(|z|\) against the normal distribution. Here \(z=\frac{7}{\sqrt{29.5}}=1.29\) and and the 2-tailed \(P\)-value is given below.
Code
sr <- c(1.5, -1.5, 4, 3)
z <- sum(sr) / sqrt(sum(sr ^ 2))
pval <- 2 * (1 - pnorm(abs(z)))
c(z=z, pval=pval)
        z      pval 
1.2888045 0.1974661 
  • If all differences are positive or all are negative, the exact 2-tailed \(P\)-value is \(\frac{1}{2^{n-1}}\)
    • implies that \(n\) must exceed 5 for any possibility of significance at the \(\alpha=0.05\) level for a 2-tailed test

7.2.1 One sample/Paired Test Example

  • Sleep Dataset
    • Compare the effects of two soporific drugs.
    • Each subject receives placebo, Drug 1, and Drug 2
    • Study question: Is Drug 1 or Drug 2 more effective at increasing sleep?
    • Dependent variable: Difference in hours of sleep comparing Drug 2 to Drug 1
    • \(H_0:\) For any given subject, the difference in hours of sleep is equally likely to be positive or negative
    • See Section 5.13.1 for a parametric test on these data
    • Ignore the RD column for now
Side comment: why did the investigators believe that the original 4-level ordinal outcome was so defective that it needed to be transformed in an information-losing way?
Hours of extra sleep on drugs 1 and 2, differences, signs and signed ranks of sleep study data. Rank differences (RD) are also shown (Section 7.2.2).
Subject Drug 1 Drug 2 Diff (2-1) Sign Rank RD
1 0.7 1.9 1.2 + 3 5
2 -1.6 0.8 2.4 + 8 8.5
3 -0.2 1.1 1.3 + 4.5 8
4 -1.2 0.1 1.3 + 4.5 5
5 -0.1 -0.1 0.0 NA 0
6 3.4 4.4 1.0 + 2 2.5
7 3.7 5.5 1.8 + 7 3
8 0.8 1.6 0.8 + 1 2.5
9 0.0 4.6 4.6 + 9 13.0
10 2.0 3.4 1.4 + 6 1.5
Code
drug1 <- c(.7, -1.6, -.2, -1.2, -.1, 3.4, 3.7, .8, 0, 2)
drug2 <- c(1.9, .8, 1.1, .1, -.1, 4.4, 5.5, 1.6, 4.6, 3.4)
wilcox.test(drug2, drug1, paired=TRUE)

    Wilcoxon signed rank test with continuity correction

data:  drug2 and drug1
V = 45, p-value = 0.009091
alternative hypothesis: true location shift is not equal to 0
Code
wilcox.test(drug2 - drug1)

    Wilcoxon signed rank test with continuity correction

data:  drug2 - drug1
V = 45, p-value = 0.009091
alternative hypothesis: true location is not equal to 0
Code
wilcox.test(drug2 - drug1, correct=FALSE)

    Wilcoxon signed rank test

data:  drug2 - drug1
V = 45, p-value = 0.007632
alternative hypothesis: true location is not equal to 0
Code
sr <- c(3, 8, 4.5, 4.5, 0, 2, 7, 1, 9, 6)
z <- sum(sr) / sqrt(sum(sr ^ 2))
c(z=z, pval=2 * (1 - pnorm(abs(z))))
          z        pval 
2.667911250 0.007632442 
Code
d <- data.frame(Drug=c(rep('Drug 1', 10), rep('Drug 2', 10),
                  rep('Difference', 10)),
                extra=c(drug1, drug2, drug2 - drug1))
  • Interpretation: Reject \(H_0\), Drug 2 increases sleep by the same hours as Drug 1 (\(p = 0.008\))
  • Could also perform sign test on sleep data
    • If drugs are equally effective, should have same number of `+’ and ‘-’
    • Observed data: 0 - 9 +, throw out 1 “no change”
    • Sign test (2-sided) \(P\)-value: Probability of observing 9 of 9 + or 9 of 9 -
    • \(p = 0.004\), so evidence against \(H_0\)
Code
2 * (1 / 2) ^ 9    # 2 * to make it two-tailed
[1] 0.00390625
  • The signed rank test assumes that the distribution of differences is symmetric
  • It tests whether the median difference is zero
  • Also tests that the mean is zero
  • In general it tests that, for two randomly chosen observations \(i\) and \(j\) with values (differences) \(x_i\) and \(x_j\), that the probability that \(x_{i}+x_{j} > 0\) is \(\frac{1}{2}\)
  • The estimator that corresponds exactly to the test in all situations is the pseudomedian, the median of all possible pairwise averages of \(x_i\) and \(x_j\), so one could say that the signed rank test tests \(H_{0}\): pseudomedian=0
  • The value \(\frac{\overline{SR}}{n+1}-\frac{1}{2}\) estimates the probability that two randomly chosen observations have a positive sum, where \(\overline{SR}\) is the mean of the column of signed ranks
  • To test \(H_{0}:\eta=\eta_{0}\), where \(\eta\) is the population median (not a difference) and \(\eta_{0}\) is some constant, we create the \(n\) values \(x_{i} - \eta_{0}\) and feed those to the signed rank test, assuming the distribution is symmetric
  • When all nonzero values are of the same sign, the test reduces to the sign test and the 2-tailed \(P\)-value is \((\frac{1}{2})^{n-1}\) where \(n\) is the number of nonzero values

Test whether the continuity correction makes \(P\)-values closer to the exact calculation2

2 The exact \(P\)-value is available only when there are no ties.}, and compare to our simple formula.

Code
# Assume we are already starting with signed ranks as x
wsr <- function(x, ...) wilcox.test(x, ...)$p.value
sim <- function(x) {
  z <- sum(x) / sqrt(sum(x ^ 2))
  2 * (1 - pnorm(abs(z))) }
alls <- function(x) round(c(
  continuity=wsr(x, correct=TRUE, exact=FALSE),
  nocontinuity=wsr(x, correct=FALSE, exact=FALSE),
  exact=wsr(x, exact=TRUE),
  simple=sim(x)), 4)
alls(1:4)
  continuity nocontinuity        exact       simple 
      0.1003       0.0679       0.1250       0.0679 
Code
alls(c(-1, 2 : 4))
  continuity nocontinuity        exact       simple 
      0.2012       0.1441       0.2500       0.1441 
Code
alls(c(-2, c(1, 3, 4)))
  continuity nocontinuity        exact       simple 
      0.3613       0.2733       0.3750       0.2733 
Code
alls(c(-1, -2, 3 : 5))
  continuity nocontinuity        exact       simple 
      0.2807       0.2249       0.3125       0.2249 
Code
alls(c(-5, -1, 2, 3, 4, 6))
  continuity nocontinuity        exact       simple 
      0.4017       0.3454       0.4375       0.3454 

From these examples the guidance is to:

  • Use the exact calculation if there are no ties
  • Otherwise use the continuity correction (i.e., the default in wilcox.test) unlike the recommendation for the Pearson \(\chi^2\) test

7.2.2 Rank Difference Test

Unlike the Wilcoxon two-sample comparison, the Wilcoxon signed-rank test has a significant disadvantage in depending on how the response varible is transformed. The rank difference test of Kornbrot (1990)3 solves this problem. Kornbrot’s procedure is as follows

3 Also available here

  • Combine the \(n\) “pre” and \(n\) “post” observations into \(2n\) observations
  • Rank these values from 1 to \(2n\) assigning midranks to ties in the usual Wilcoxon fashion
  • Perform the Wilcoxon signed-rank test on the paired ranks

The test is implemented in the R rankdifferencetest package by Brett Klamer4 but the essence of the calculation is below where we run the rank difference test on the data analyzed previously. The rank differences RD are shown in the data table there.

4 The rankdifferencetest package will provide exact p-values in case of ties.

Code
m <- length(drug1)
ranks <- rank(c(drug1, drug2))
rank.diffs <- ranks[-(1 : m)] - ranks[1 : m]
wilcox.test(rank.diffs)

    Wilcoxon signed rank test with continuity correction

data:  rank.diffs
V = 45, p-value = 0.00903
alternative hypothesis: true location is not equal to 0

The results were identical to the signed-rank test because every subject had a higher drug 2 response than their drug 1 response except for the one that was tied which is excluded from both analyses.

7.3 Two Sample Test: Wilcoxon–Mann–Whitney

  • The Wilcoxon–Mann–Whitney (WMW) 2-sample rank sum test is for testing for equality of central tendency of two distributions (for unpaired data)
  • Ranking is done by combining the two samples and ignoring which sample each observation came from
  • Example:
Females 120 118 121 119
Males 124 120 133
Ranks for Females 3.5 1 5 2
Ranks for Males 6 3.5 7
  • Doing a 2-sample \(t\)-test using these ranks as if they were raw data and computing the \(P\)-value against 4+3-2=5 d.f. will work quite well

  • Some statistical packages compute \(P\)-values exactly (especially if there are no ties)

  • Loosely speaking the WMW test tests whether the population medians of the two groups are the same

  • More accurately and more generally, it tests whether observations in one population tend to be larger than observations in the other

  • Letting \(x_1\) and \(x_2\) respectively be randomly chosen observations from populations one and two, WMW tests \(H_{0}:c=\frac{1}{2}\), where \(c=\)Prob\([x_{1} > x_{2}]\)

  • The \(c\) index (concordance probability) may be estimated by computing \[c = \frac{\bar{R}-\frac{n_{1}+1}{2}}{n_{2}},\] where \(\bar{R}\) is the mean of the ranks in group 1;
    For the above data \(\bar{R} = 2.875\) and \(c=\frac{2.875-2.5}{3}=0.125\), so we estimate that the probability is 0.125 that a randomly chosen female has a value greater than a randomly chosen male.

  • In diagnostic studies where \(x\) is the continuous result of a medical test and the grouping variable is diseased vs. non-diseased, \(c\) is the area under the receiver operating characteristic (ROC) curve

  • Test still has the “probability of ordering” interpretation when the variances of the two samples are markedly different, but it no longer tests anything like the difference in population medians If there is no overlap between measurements from group 1 and those from group 2, the exact 2-sided \(P\)-value for the Wilcoxon test is \(2/ \frac{n!}{n_{1}! n_{2}!}\). If \(n_{1}=n_{2}\), \(n_{1}\) must be \(\geq 4\) to obtain \(P < 0.05\) (in this case \(P = 0.029\)).

7.3.1 Two Sample WMW example

  • Fecal calprotectin being evaluated as a possible biomarker of disease severity
  • Calprotectin measured in 26 subjects, 8 observed to have no/mild activity by endoscopy
  • Calprotectin has upper detection limit at 2500 units
    • A type of missing data, but need to keep in analysis
  • Study question: Are calprotectin levels different in subjects with no or mild activity compared to subjects with moderate or severe activity?
  • Statement of the null hypothesis
    • \(H_0:\) Populations with no/mild activity have the same distribution of calprotectin as populations with moderate/severe activity
    • \(H_0: c = \frac{1}{2}\)
Code
#Fecal Calprotectin: 2500 is above detection limit
calpro <- c(2500, 244, 2500, 726, 86, 2500, 61, 392, 2500, 114, 1226,
            2500, 168, 910, 627, 2500, 781, 57, 483, 30, 925, 1027,
            2500, 2500, 38, 18)

# Endoscopy score: 1 = No/Mild, 2=Mod/Severe Disease
# Would have been far better to code dose as 4 ordinal levels
endo <- c(2, 1, 2, 2, 2, 1, 1, 2, 2, 1, 2, 2, 1, 2, 2, 2, 2, 1, 2,
          2, 2, 2, 2, 2, 1, 1)
endo <- factor(endo, 1 : 2,
               c("No or Mild Activity", "Moderate or Severe Activity"))
require(ggplot2)
ggplot(data.frame(endo, calpro), aes(y=calpro, x=endo)) +
  geom_boxplot(color='lightblue', alpha=.85, width=.4) +
  geom_dotplot(binaxis='y', stackdir='center', position='dodge') +
    xlab('') + ylab('Fecal Calprotectin') + coord_flip() +
      geom_hline(aes(yintercept=2500, col=I('red')), linetype='dotted')
wilcox.test(calpro ~ endo)

    Wilcoxon rank sum test with continuity correction

data:  calpro by endo
W = 23.5, p-value = 0.006814
alternative hypothesis: true location shift is not equal to 0
Figure 7.1: Fecal calprotectin by endoscopy severity rating. Red dotted line is the detection limit. Ordinal disease categories should not have been combined.

The following plots the ranks that are used in the Wilcoxon-Mann-Whitney two-sample rank sum test.

Code
ggplot(data.frame(endo, calpro), aes(y=rank(calpro), x=endo)) +
  geom_dotplot(binaxis='y', stackdir='center', position='dodge') +
    xlab('') + ylab('Rank of Fecal Calprotectin') + coord_flip()
Figure 7.2: Ranks of calprotectin
  • Test statistic \(W\) equals the sum of the ranks in the no/mild group minus \(n_1 \times (n_1 + 1) / 2\), where \(n_1\) is the number of subjects in then no/mild sample
  • \(W = 59.5 - \frac{8\times 9}{2} = 23.5\)
  • A common (but loose) interpretation: People with moderate/severe activity have higher median fecal calprotectin levels than people with no/mild activity (\(p = 0.007\)).
  • Better: remove median and supplement with the \(c\)-index (concordance probability) or Somers’ \(D_{xy}\) rank correlation between calprotectin and endoscopy status. The code for the R somers2 function shows how the concordance probability is computed from the mean of the ranks in one of the two groups.
Code
require(Hmisc)
# Convert endo to a binary variable
somers2(calpro, endo=='Moderate or Severe Activity')
         C        Dxy          n    Missing 
 0.8368056  0.6736111 26.0000000  0.0000000 

If you type somers2 to list the code for the function you will see that the \(c\)-index is tightly related to the Wilcoxon test when you see this code:

Code
mean.rank <- mean(rank(x)[y == 1])
c.index <- (mean.rank - (n1 + 1)/2) / (n - n1)

7.3.2 Point and Interval Estimates for Wilcoxon Two-Sample Comparison

As mentioned earlier, the effect estimate that is exactly consistent with the Wilcoxon two-sample test is the robust Hodges-Lehman estimator—the median of all possible differences between a measurement from group 1 and a measurement from group 2. There is a confidence interval for this estimator.

  • Assume data come from distributions with same shape and differ only in location
  • Consider a sample of 4 males and 3 females
  • Difference in sample medians is 124 - 119.5 = 4.5
  • Consider all possible differences between sample 1 and sample 2
Female \(\rightarrow\)
Male \(\downarrow\) 120 118 121 119
124 4 6 3 5
120 0 2 -1 1
133 13 15 12 14
  • Hodges-Lehman estimate of the sex effect: median of the 12 differences = 4.5
  • In this case equaled difference in sample medians just by coincidence
Code
female <- c(120, 118, 121, 119)
male   <- c(124, 120, 133)
differences <- outer(male, female, '-')
differences
     [,1] [,2] [,3] [,4]
[1,]    4    6    3    5
[2,]    0    2   -1    1
[3,]   13   15   12   14
Code
median(differences)
[1] 4.5
Code
# Can't figure out how difference in location is computed below
# It's not the Hodges-Lehman estimate
wilcox.test(male, female, conf.int=TRUE)

    Wilcoxon rank sum test with continuity correction

data:  male and female
W = 10.5, p-value = 0.1536
alternative hypothesis: true location shift is not equal to 0
95 percent confidence interval:
 -1 15
sample estimates:
difference in location 
              4.791134 

In general, \(1 - \alpha\) confidence intervals are the set of values that if hypothesized to be the true location parameter would not be rejected at the \(\alpha\) level. wilcox.test computes the location shift by solving for the hypothesized value that yields \(P=1.0\) instead of the more proper median of all differences. Look into this further by plotting the \(P\)-value as a function of the hypothesized value.

Code
dif  <- seq(-3, 15, by=.1)
n    <- length(dif)
pval <- numeric(n)
for(i in 1 : n) pval[i] <- wilcox.test(male - dif[i], female)$p.value
ggplot(data.frame(dif, pval), aes(x=dif, y=pval)) +
  geom_step() +
  geom_hline(yintercept=.05, col='red', linetype='dotted') +
  geom_vline(xintercept=c(4.5, 4.791, -1, 15), col='red', linetype='dotted') +
  xlab('Difference') + ylab('P-value')
Figure 7.3: Wilcoxon \(P\)-value vs. hypothesized male-female difference. Horizontal line is \(P=0.05\). Vertical lines from left to right are the lower 0.95 confidence limit from wilcox.test, the median difference, the Hodges-Lehman estimator as computed by wilcox.test, and the upper 0.95 confidence limit from wilcox.test.

See Section 7.4 for a more approximate confidence interval.

7.4 Confidence Intervals for Medians and Their Differences

  • Confidence intervals for the median (one sample)
    • Table 18.4 (Altman) gives the ranks of the observations to be used to give approximate confidence intervals for the median
    • e.g., if \(n = 12\), the \(3^\mathrm{rd}\) and \(10^\mathrm{th}\) largest values give a \(0.961\) confidence interval
    • For larger sample sizes, the lower ranked value (\(r\)) and upper ranked value (\(s\)) to select for an approximate \(0.95\) confidence interval for the population median is \[r = \frac{n}{2} - 1.96*\frac{\sqrt{n}}{2} ~~~\mathrm{and}~~~ s = 1 + \frac{n}{2} + 1.96*\frac{\sqrt{n}}{2}\]
    • e.g., if \(n = 100\) then \(r = 40.2\) and \(s = 60.8\), so we would pick the \(40^\mathrm{th}\) and \(61^\mathrm{st}\) largest values from the sample to specify a \(0.95\) confidence interval for the population median
    • For exact confidence interval for the median see
      stats.stackexchange.com/questions/186957, which also discusses why there is no exact nonparametric confidence interval for the mean. Let’s get the exact order statistics that result in an exact confidence interval for the median:
Code
# Exact CI for median from DescTools package SignTest.default
# See also ttp://www.stat.umn.edu/geyer/old03/5102/notes/rank.pdf,
# http://de.scribd.com/doc/75941305/Confidence-Interval-for-Median-Based-on-Sign-Test
cimed <- function(x, alpha=0.05, na.rm=FALSE) {
  if(na.rm) x <- x[! is.na(x)]
  n <- length(x)
  k <- qbinom(p=alpha / 2, size=n, prob=0.5, lower.tail=TRUE)
  ## Actual CL: 1 - 2 * pbinom(k - 1, size=n, prob=0.5) >= 1 - alpha
  sort(x)[c(k, n - k + 1)]
}
cimed(1 : 100)
[1] 40 61

For \(n=100\) we see that the approximate interval happened to be exact.

  • Confidence intervals for the difference in two medians (two samples)
    • We don’t have a nonparametric interval for this
    • Instead get Hodges-Lehman estimate
    • Assume data come from distributions with same shape and differ only in location
    • Considers all possible differences between sample 1 and sample 2 using male-female data in Section 7.3.2
    • An estimate of the median difference (males - females) is the median of these 12 differences, with the \(3^\mathrm{rd}\) and \(10^\mathrm{th}\) largest values giving an (approximate) 0.95 CI
    • Median estimate = 4.5, 0.95 CI = [1, 13]
    • Specific formulas found in Altman, pages 40-41
  • Bootstrap
    • General method, not just for medians
    • Non-parametric, does not assume symmetry (but may not be accurate)
    • Iterative method that repeatedly samples from the original data
    • Algorithm for creating a \(0.95\) CI for the difference in two medians
  1. Sample with replacement from sample 1 and sample 2
  2. Calculate the difference in medians, save result
  3. Repeat Steps 1 and 2 1000 times
    • A (naive) \(0.95\) CI is given by the \(25^\mathrm{th}\) and \(975^\mathrm{th}\) largest values of your \(1000\) median differences
    • For the male/female data, median estimate = 4.5, 0.95 CI = [-0.5, 14.5], which agrees with the conclusion from a WMW rank sum test (\(p = 0.15\)). Note that the more accurate CI for the Hodges-Lehman estimate of \([-1, 15]\) was given earlier (output of wilcox.test).
Code
diffs <- numeric(1000)
set.seed(13)
for(i in 1 : 1000) diffs[i] <-
  median(sample(male, replace=TRUE)) - median(sample(female, replace=TRUE))
ggplot(data.frame(diffs), aes(x=diffs)) + xlab('Differences in Medians') +
  geom_histogram(bin_widith=.01, color='blue', fill='white')
quantile(diffs, c(0.025, 0.975))
 2.5% 97.5% 
 -0.5  14.5 
Figure 7.4: Bootstrapped differences in medians

But recall that the Wilcoxon test does not really test the difference in medians but rather the median of all differences.

7.5 Strategy

  • Don’t assess normality of data
  • Use nonparametric test in any case, to get \(P\)-values
  • Use nonparametric confidence intervals for means and medians5 which will be more in conformance to what the nonpar. test is testing
  • To obtain nonparametric confidence limits for means and differences in means, the bootstrap percentile method may easily be used and it does not assume symmetry of the data distribution

5 A good nonparametric confidence for a population mean that does not even assume a symmetric distribution can be obtained from the bootstrap simulation procedure.

7.6 Generalization of the Wilcoxon/Kruskal-Wallis Test

  • Proportional odds (PO) ordinal logistic model
  • Contains Wilcoxon 2-sample and Kruskal-Wallis tests as special cases
    • numerator of the score test for the PO model, when there is only the grouping variable in the model, is exactly the Wilcoxon statistic
  • Special case of PO model is the ordinary binary logistic model
  • Advantages over nonparametric tests:
    • can adjust for covariates
    • more accurate \(P\)-values even with extreme number of tied values
    • provides a framework for consistent pairwise comparisons6
    • provides estimates of means, quantiles, and exceedance probabilities
    • sets the stage for a Bayesian PO model, so can get a Bayesian Wilcoxon test
  • Other ordinal response models are available, e.g., Cox proportional hazards model
  • These models are semiparametric models
    • parametric in additivity and linearity (by default) assumptions
    • nonparametric in not assuming a distribution for the response variable
  • Like nonparametric tests, \(P\)-values are unaffected by monotonic transformations of \(Y\)
  • If the response variable \(Y\) has \(k\) distinct values \(y_1, y_2, \dots, y_k\) in the sample, semiparametric models have \(k-1\) intercepts
  • Binary logistic model deals with prob. of only one event (\(Y=1\) vs. \(Y=0\))
  • For ordinal \(Y\) there are \(k-1\) events
  • Model these as cumulative probabilities to make use of ordering of \(Y\) values
  • Model: \(P(Y \geq y | X) = \frac{1}{1 + \exp[-(\alpha_y + \beta_1 x_1 + \beta_2 x_2 + \dots)]}\)
  • \(\alpha_y\) is the \(j^{\mathrm{th}}\) intercept when \(y = y_{j+1}\), e.g. the first intercept corresponds to the second lowest distinct \(Y\) value \(y_2\)
  • Special case: 2-group problem: \(P(Y \geq y | \mathrm{group}) = \frac{1}{1 + \exp[-(\alpha_y + \beta_1[\mathrm{group B}])]}\)
    • \(\exp(\beta_1)\) is the ratio of odds that \(Y \geq y\) in group B vs. \(Y \geq y\) in group A, for all \(y > y_1\)
    • as before \([x]\) is the 0-1 indicator variable for \(x\) being true
    • \(\beta_1 > 0 \rightarrow Y\) values higher in group B
    • \(k=2 \rightarrow\) model is the binary logistic model (where we take \(\alpha_1 = \beta_0\))
  • These intercepts \(\alpha_1, \alpha_2, \dots, \alpha_{k-1}\) encode the entire empirical distribution of \(Y\) for one of the groups
    • \(\rightarrow\) the model assumes nothing about the \(Y\) distribution
    • it only assumes how the distribution of \(Y\) for one type of subject is connected to the distribution for another type of subject
    • PO model for a two-group problem assumes that the logit of the two cumulative distribution functions are parallel
    • if PO doesn’t hold, PO model may still be better than alternatives
    • PO is also what the Wilcoxon/Kruskal-Wallis test assumes to have optimal power
    • don’t need an \(\alpha\) for the lowest observed \(Y\) value since \(P(Y \geq ~\mathrm{minimum } Y) = 1\)
  • R rms package orm function fits the PO model7 and is especially made for continuous \(Y\), with fast run times for up to 6000 intercepts

6 When using the Kruskal-Wallis test followed by pairwise Wilcoxon tests, these pairwise tests can be inconsistent with each other, because they re-rank the data based only on two groups, destroying the transitivity property, e.g. treatment A can be better than B which is better than C but C is better than A.

7 orm also fits other models using link functions other than the logit.

7.6.1 Kruskal-Wallis Test

  • Notice we haven’t described rank ANOVA—the Kruskal-Wallis test
  • Don’t need it; just form a PO model with more than one indicator variable
  • E.g., to test for any differences among four groups A B C D form 3 indicator variables for B C D and let A be the reference cell that corresponds to the \(\alpha\) intercepts
    • model is logit\(P(Y \geq y | \mathrm{group}) = \alpha_y + \beta_1 [B] + \beta_2 [C] + \beta_3 [D]\)
  • Use the likelihood ratio \(\chi^2\) test from this model to test the global null hypothesis A=B=C=D with 3 d.f.
  • Solves the transitivity problem mentioned earlier
  • Can obtain consistent pairwise comparisons by forming odds ratios for any comparison
    • e.g. C:A comparison will use \(\exp(\hat{\beta_{2}})\)
    • C:B comparison OR: \(\exp(\hat{\beta_{2}} - \hat{\beta_{1}})\)
  • As before can convert the ORs to differences in medians/means because unlike the original nonparametric tests, the PO model can be used to obtain many types of predictions8
  • Illustrate this by a non-PO example, checking to see how well the PO model can recover the sample means when assuming (the slightly incorrect) PO
  • Take 4 samples from normal distributions with the same variances but different means
  • Also show how to compare two of the samples without re-ranking the data as inconsistent Wilcoxon tests would do

8 The predicted mean for a set of covariate settings is obtained by using all the intercepts and \(\beta\)s to get exceedance probabilities for \(Y \geq y\), taking successive differences in those probabilities to get cell probabilities that \(Y=y\), then multiplying cell probabilities by the \(y\) value attached to them, and summing. This is the formula for the mean for a discrete distribution.

Code
set.seed(1)
group <- rep(c('A','B','C','D'), 100)
y <- rnorm(400, 100, 15) + 10*(group == 'B') + 20*(group=='C') + 30*(group=='D')
require(rms)
options(prType='html')
dd <- datadist(group, y); options(datadist='dd')
f <- orm(y ~ group)
f    # use LR chi-square test as replacement for Kruskal-Wallis

Logistic (Proportional Odds) Ordinal Regression Model

orm(formula = y ~ group)
Model Likelihood
Ratio Test
Discrimination
Indexes
Rank Discrim.
Indexes
Obs 400 LR χ2 193.31 R2 0.383 ρ 0.633
Distinct Y 400 d.f. 3 R23,400 0.379
Y0.5 115.4143 Pr(>χ2) <0.0001 R23,400 0.379
max |∂log L/∂β| 2×10-6 Score χ2 193.21 |Pr(Y ≥ median)-½| 0.256
Pr(>χ2) <0.0001
β S.E. Wald Z Pr(>|Z|)
group=B  1.4221  0.2579 5.51 <0.0001
group=C  2.6624  0.2762 9.64 <0.0001
group=D  3.6606  0.2925 12.52 <0.0001
Code
# Derive R function to use all intercepts and betas to compute predicted means
M <- Mean(f)
Predict(f, group, fun=M)
  group      yhat     lower    upper
1     A  99.32328  96.46657 102.1800
2     B 111.21326 108.27169 114.1548
3     C 121.63880 118.78670 124.4909
4     D 129.70290 127.07203 132.3338

Response variable (y):  

Limits are 0.95 confidence limits
Code
# Compare with sample means
summarize(y, group, smean.cl.normal)
  group         y     Lower    Upper
1     A  98.72953  95.81508 101.6440
2     B 111.69464 108.61130 114.7780
3     C 121.80841 118.93036 124.6865
4     D 130.05275 127.40318 132.7023
Code
# Compare B and C
k <- contrast(f, list(group='C'), list(group='B'))
k
   Contrast      S.E.     Lower    Upper    Z Pr(>|z|)
11 1.240366 0.2564632 0.7377076 1.743025 4.84        0

Confidence intervals are 0.95 individual intervals
Code
# Show odds ratios instead of differences in betas
print(k, fun=exp)
   Contrast S.E.    Lower    Upper    Z Pr(>|z|)
11 3.456879   NA 2.091136 5.714604 4.84        0

Confidence intervals are 0.95 individual intervals

7.6.2 PO Re-analysis

  • Reconsider the calprotectin data analyzed in Section 7.3.1
  • Wilcoxon: \(P=0.0068, c=0.837\)
  • Frequentist PO model:

Code
require(rms)
options(prType='html')
dd <- datadist(calpro, endo); options(datadist='dd')
f <- orm(calpro ~ endo)
print(f, intercepts=TRUE)

Logistic (Proportional Odds) Ordinal Regression Model

orm(formula = calpro ~ endo)

Frequencies of Responses

  18   30   38   57   61   86  114  168  244  392  483  627  726  781  910  925 
   1    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1 
1027 1226 2500 
   1    1    8 
Model Likelihood
Ratio Test
Discrimination
Indexes
Rank Discrim.
Indexes
Obs 26 LR χ2 9.84 R2 0.317 ρ 0.547
Distinct Y 19 d.f. 1 R21,26 0.288
Y0.5 753.5 Pr(>χ2) 0.0017 R21,25.2 0.296
max |∂log L/∂β| 5×10-5 Score χ2 9.86 |Pr(Y ≥ median)-½| 0.251
Pr(>χ2) 0.0017
β S.E. Wald Z Pr(>|Z|)
y≥30   2.0969  1.0756 1.95 0.0512
y≥38   1.3395  0.8160 1.64 0.1007
y≥57   0.8678  0.7135 1.22 0.2239
y≥61   0.4733  0.6689 0.71 0.4792
y≥86   0.1122  0.6575 0.17 0.8645
y≥114  -0.1956  0.6558 -0.30 0.7655
y≥168  -0.4710  0.6608 -0.71 0.4760
y≥244  -0.7653  0.6868 -1.11 0.2652
y≥392  -1.0953  0.7427 -1.47 0.1403
y≥483  -1.4155  0.8015 -1.77 0.0774
y≥627  -1.6849  0.8383 -2.01 0.0445
y≥726  -1.9227  0.8641 -2.23 0.0261
y≥781  -2.1399  0.8836 -2.42 0.0154
y≥910  -2.3439  0.8993 -2.61 0.0092
y≥925  -2.5396  0.9128 -2.78 0.0054
y≥1027  -2.7312  0.9249 -2.95 0.0031
y≥1226  -2.9224  0.9365 -3.12 0.0018
y≥2500  -3.1166  0.9482 -3.29 0.0010
endo=Moderate or Severe Activity   2.7586  0.9576 2.88 0.0040

  • Intercept -3.1166 corresponds \(Y\) being at or above the upper detection limit
  • Use the likelihood ratio (LR) \(\chi^2\) test from the model
  • To estimate an exceedance probability just select the corresponding intercept and compute as for a binary logistic model
  • The 18 intercepts for 19 distinct \(Y\) values represent the logit of the empirical cumulative distribution function for the no/mild reference group if the two groups are in proportional odds9. Add 2.7586 to those intercepts to get the logit CDF for the moderate/severe group.
  • Compute odds ratio and CI

9 The intercepts really represent the logit of one minus the CDF, moved one \(Y\) value.

Code
summary(f, endo='No or Mild Activity')
Effects   Response: calpro
Low High Δ Effect S.E. Lower 0.95 Upper 0.95
endo --- Moderate or Severe Activity:No or Mild Activity 1 2 2.759 0.9576 0.8817 4.635
Odds Ratio 1 2 15.780 2.4150 103.100

  • The above odds ratio of 15.8 is the odds of having calprotectin \(\geq y\) in the moderate/severe activity group vs. the no/mild activity group
    • By the PO assumption this odds ratio is the same for all \(y\)
  • Simulations provided an empirical conversion of the PO regression coefficient to \(c\):
Code
b <- coef(f)['endo=Moderate or Severe Activity']
cindex <- plogis((b - 0.0029) / 1.5405)
cindex
endo=Moderate or Severe Activity 
                       0.8567812 

Compare this to the exact value of 0.837.

  • From the fitted PO model obtain for each group, compute along with sample estimates:
    • prob. calprotectin at or above the upper limit of normal
    • mean
    • median
  • In the output of Predict() see the point estimates under yhat, starting with the estimates for \(P(Y \geq 2500)\), i.e., marker value at or above the upper detection limit
Code
ex <- ExProb(f)
exceed <- function(lp) ex(lp, y=2500)
ymean  <- Mean(f)
yquant <- Quantile(f)
ymed   <- function(lp) yquant(0.5, lp=lp)
Predict(f, endo, fun=exceed)
                         endo       yhat       lower     upper
1         No or Mild Activity 0.04242948 0.008080776 0.1941982
2 Moderate or Severe Activity 0.41144485 0.209594428 0.6482557

Response variable (y):  

Limits are 0.95 confidence limits
Code
# Compute empirical exceedance probabilities
tapply(calpro >= 2500, endo, mean)
        No or Mild Activity Moderate or Severe Activity 
                  0.1250000                   0.3888889 
Code
# Note that imposing PO assumption made modeled means closer together than
# stratified sample means
Predict(f, endo, fun=ymean)
                         endo     yhat       lower     upper
1         No or Mild Activity  300.259   0.3898075  600.1282
2 Moderate or Severe Activity 1387.660 947.0112521 1828.3090

Response variable (y):  

Limits are 0.95 confidence limits
Code
tapply(calpro, endo, mean)
        No or Mild Activity Moderate or Severe Activity 
                    400.000                    1372.944 
Code
Predict(f, endo, fun=ymed)
                         endo      yhat     lower     upper
1         No or Mild Activity   82.2074  25.72658  574.9781
2 Moderate or Severe Activity 1038.8082 622.74020 2500.0000

Response variable (y):  

Limits are 0.95 confidence limits
Code
tapply(calpro, endo, median)
        No or Mild Activity Moderate or Severe Activity 
                       87.5                       976.0 
  • Note: confidence intervals for these derived quantities are approximate

7.7 Checking Assumptions of the Wilcoxon Test

  • Proportional odds (PO) model and its special case the Wilcoxon test assume PO
  • What does it mean to assume PO?
    • Under \(H_0\): the two distributions are identical there is no assumption, i.e., type I error probability will behave as advertised
    • Under \(H_1\) the test may still work OK but it will not be optimal unless PO holds
  • To check PO:
    • Compute the empirical cumulative distribution function (ECDF) for the response variable, stratified by group (see Section 4.3.3.2)
    • Take the logit transformation of each ECDF
    • Check for parallelism
    • Linearity would be required only if using a parametric logistic distribution instead of using our semiparametric PO model
  • Parametric \(t\)-test requires parallelism and linearity when the ECDFs are normal-inverse transformed
    • linearity: normal distribution (like q-q plot)
    • parallelism: equal variances
  • Problem with assumption checking is that with small samples ECDFs are too noisy to see patterns clearly
  • Example from a larger dataset: Mayo Clinic Primary Biliary Cirrhosis Dataset
  • Compare distribution of serum bilirubin for those patients with spider veins vs. those without
Code
getHdata(pbc)
# Take logit of ECDF
Ecdf(~ bili, group = spiders, data=pbc, fun=qlogis)
Figure 7.5: Checking for parallelism on the logit scale (proportional odds Wilcoxon assumption)
  • The curves are primarily parallel (even at the far left, despite the optical illusion)
  • Nonlinearity is irrelevant
  • Check \(t\)-test assumptions
Code
Ecdf(~ bili, group=spiders, data=pbc, fun=qnorm)
Figure 7.6: Checking for parallelism on the normal inverse scale (equal variance \(t\)-test and normal scores rank test assumptions)
  • Curves are primarily parallel (variances are equal; probit ordinal regression or normal scores rank test would work)
  • But they are not straight lines as required by \(t\)-test normality assumption

7.8 Two-Way ANOVA Ordinal Regression Example

  • Problem and data posted to stats.stackexchange.com/questions/622230
  • Two binary factors: sex and surface (for dorsal/ventral)
  • Response variable: y
  • There is significant uncertainty in how to transform y for a parametric analysis
  • Ordinal models provide the same regression coefficients, standard errors, and confidence intervals regardless of how y is transformed (as long as the transformation is monotonic, e.g., log, square root, etc.)
  • Quantiles are also unaffected by any choice of transformation
  • Estimates means do depend on the transformation used
  • Ordinal models are also robust to “outliers”
Code
d <- csv.get(textConnection('
id  sex surface y
1   female  UN  1255
2   female  UN  542
3   female  UN  818
1   female  UN  274
2   female  UN  261
3   female  UN  314
1   female  UP  552
2   female  UP  548
3   female  UP  721
1   female  UP  431
2   female  UP  354
3   female  UP  738
4   male    UN  901
5   male    UN  619
6   male    UN  861
7   male    UN  713
8   male    UN  717
4   male    UN  275
5   male    UN  300
6   male    UN  244
7   male    UN  281
8   male    UN  231
4   male    UP  532
5   male    UP  451
6   male    UP  482
7   male    UP  374
8   male    UP  424
4   male    UP  193
5   male    UP  118
6   male    UP  207
7   male    UP  208
8   male    UP  252
'), sep='\t')
setDT(d)
1
\t is the tab character
2
turn the data.frame into a data.table
  • Tentatively choose the proportional odds (PO) model
  • Check its assumptions (note this is difficult with small \(N\))
  • Stratify by both factors so there are 4 groups
  • Check the logit link (PO assumption) by taking the logit of the stratified empirical cumulative distribution function (CDF) estimates
    • Can’t take logit of 1.0 so for those values substitute 0.99
  • PO assumption \(\equiv\) parallelism of logit-transformed cumulative distributions
  • PO assumption is analogous to equal variance assumption in ANOVA
  • Unlike ANOVA there is no assumption about the shape of any one group’s CDF
Code
v <- d[, ecdfSteps(y, extend=FALSE), by=.(sex, surface)]
ggplot(v, aes(x, qlogis(pmin(y, 0.99)), color=sex, linetype=surface)) +
  geom_step() + xlab('y') + ylab('logit ECDF')
1
ecdfSteps is in the Hmisc package; it computes coordinates of empirical cumulative distribution functions; data.table uses by= for stratification
Figure 7.7: logit ECDF plots for checking the PO assumption (parallelism) in the 2-factor problem
  • Parallelism is hard to check with \(N=32\) stratified into 4 groups, but no blatant violations are obvious
  • Fit a proportional odds model without interaction, as the sample size does not allow one to adequately estimate the interaction effect
  • Use likelihood ratio \(\chi^2\) statistics, which are more accurate than the usual Wald statistics
Code
dd <- datadist(d); options(datadist='dd')
f <- orm(y ~ sex + surface, data=d, x=TRUE, y=TRUE)
print(f, intercepts=TRUE)
anova(f, test='LR')
1
Save data distribution summaries
2
orm is the rms package’s function for analyzing continuous responses as ordinal. It also works for discrete Y but the rms lrm function is sometimes better for that. x=TRUE, y=TRUE is for getting likelihood ratio \(\chi^2\) tests.
3
The default is not to print the intercepts if there are more than 10 of them

Logistic (Proportional Odds) Ordinal Regression Model

orm(formula = y ~ sex + surface, data = d, x = TRUE, y = TRUE)
Model Likelihood
Ratio Test
Discrimination
Indexes
Rank Discrim.
Indexes
Obs 32 LR χ2 4.67 R2 0.136 ρ 0.395
Distinct Y 32 d.f. 2 R22,32 0.080
Y0.5 427.5 Pr(>χ2) 0.0966 R22,32 0.080
max |∂log L/∂β| 9×10-6 Score χ2 4.54 |Pr(Y ≥ median)-½| 0.137
Pr(>χ2) 0.1036
β S.E. Wald Z Pr(>|Z|)
y≥193   4.8126  1.2445 3.87 0.0001
y≥207   4.0573  1.0130 4.01 <0.0001
y≥208   3.5856  0.9134 3.93 <0.0001
y≥231   3.2269  0.8502 3.80 0.0001
y≥244   2.9369  0.8066 3.64 0.0003
y≥252   2.6957  0.7774 3.47 0.0005
y≥261   2.4797  0.7531 3.29 0.0010
y≥274   2.2863  0.7327 3.12 0.0018
y≥275   2.1194  0.7194 2.95 0.0032
y≥281   1.9677  0.7101 2.77 0.0056
y≥300   1.8238  0.7031 2.59 0.0095
y≥314   1.6858  0.6981 2.41 0.0157
y≥354   1.5559  0.6956 2.24 0.0253
y≥374   1.4342  0.6955 2.06 0.0392
y≥424   1.3127  0.6954 1.89 0.0591
y≥431   1.1861  0.6937 1.71 0.0873
y≥451   1.0580  0.6920 1.53 0.1263
y≥482   0.9273  0.6902 1.34 0.1791
y≥532   0.7877  0.6863 1.15 0.2511
y≥542   0.6365  0.6795 0.94 0.3489
y≥548   0.4821  0.6742 0.71 0.4746
y≥552   0.3286  0.6732 0.49 0.6255
y≥619   0.1697  0.6743 0.25 0.8013
y≥713   0.0003  0.6775 0.00 0.9996
y≥717  -0.1863  0.6830 -0.27 0.7850
y≥721  -0.3969  0.6917 -0.57 0.5661
y≥738  -0.6380  0.7055 -0.90 0.3658
y≥818  -0.9211  0.7283 -1.26 0.2060
y≥861  -1.2581  0.7754 -1.62 0.1047
y≥901  -1.7012  0.8727 -1.95 0.0512
y≥1255  -2.4509  1.1155 -2.20 0.0280
sex=male  -1.2211  0.6677 -1.83 0.0674
surface=UP  -0.7824  0.6446 -1.21 0.2249
Likelihood Ratio Statistics for y
χ2 d.f. P
sex 3.47 1 0.0625
surface 1.50 1 0.2210
TOTAL 4.67 2 0.0966
  • Note that the intercepts encode the (logits of) one minus the empirical distribution function for the reference group (female UN)
  • One adjusted \(R^2\) is duplicated because the actual sample size (32) equals the effective sample size due to there being no ties in y
  • There is a small amount of evidence (\(p=0.097\)) that either of the two factors is associated with y
  • There is very little evidence for an association with surface
  • There is a small amount of evidence for a sex “effect”
  • The low \(R_\text{adj}^{2}\) of 0.08 implies that the model is of limited use for estimation / prediction
  • Estimate means and medians for all 4 combinations of factors, using the model and empirically
This is why no distribution is assumed for any one covariate setting.
Code
M <- Mean(f)
qu <- Quantile(f)
med <- function(lp) qu(0.5, lp)
g <- function(x) list(Mean=mean(x), Median=median(x))
cat('Sample estimates\n')
d[, g(y), by=.(sex, surface)]
cat('\nModel estimates of means (yhat)\n')
Predict(f, surface, sex=.q(female, male), fun=M)
cat('\nModel estimates of medians (yhat)\n')
Predict(f, surface, sex=.q(female, male), fun=med)
1
Derives an R function that translates from logit scale to mean
2
Likewise for quantiles
3
Particular case of quantiles: median
4
fun=... causes predicted log odds to be transformed by the named function; .q is an R Hmisc package function that quotes strings for you
Sample estimates
      sex surface     Mean Median
1: female      UN 577.3333  428.0
2: female      UP 557.3333  550.0
3:   male      UN 514.2000  459.5
4:   male      UP 324.1000  313.0

Model estimates of means (yhat)
  surface    sex     yhat    lower    upper
1      UN female 640.6768 443.1608 838.1929
2      UP female 523.7141 365.5194 681.9088
3      UN   male 463.0476 324.4904 601.6048
4      UP   male 368.6684 259.6604 477.6764

Response variable (y):  

Limits are 0.95 confidence limits

Model estimates of medians (yhat)
  surface    sex     yhat    lower    upper
1      UN female 669.7486 384.5829 856.3839
2      UP female 508.5184 279.8696 718.9218
3      UN   male 419.3042 264.8156 613.7943
4      UP   male 277.0920 227.9897 463.4452

Response variable (y):  

Limits are 0.95 confidence limits
  • Note that empirical stratified sample estimates of means and medians do not assume PO and do not assume the absence of an interaction, but they have low precision due to dividing the sample size by 4
  • Model estimates assume no interaction and effectively divide the sample size by 3

7.9 Regression Analysis of Paired Data

Just as a binary logistic model may be used to do the McNemar test for paired binary data, regression can be used for testing for effects in paired ordinal or continuous data. Regression is more general:

  • Can adjust for covariates

  • Can have more than 2 (pre, post) periods, e.g., handles a multi-period crossover study

  • Paired \(t\)-test is equivalent to

    • linear regression blocking on subject (subjects are fixed effects)
    • mixed effects linear model with random intercepts for subjects
  • Rank difference test is similar to

    • PO model blocking on subject (but there problems with having too many parameters)
    • mixed effects PO model

Let’s illustrate this by re-analyzing paired data analyzed previously.

Code
w <- t.test(drug1, drug2, paired=TRUE)
w

    Paired t-test

data:  drug1 and drug2
t = -4.0621, df = 9, p-value = 0.002833
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
 -2.4598858 -0.7001142
sample estimates:
mean difference 
          -1.58 
Code
w$stderr   # fetch standard error of mean difference
[1] 0.3889587
Code
y    <- c(drug1, drug2)
drug <- rep(c('A', 'B'), each=length(drug1))
id   <- factor(rep(1 : length(drug1), 2))
d    <- data.frame(drug, y, id)
dd   <- datadist(d); options(datadist='dd')
ols(y ~ drug + id)

Linear Regression Model

ols(formula = y ~ drug + id)
Model Likelihood
Ratio Test
Discrimination
Indexes
Obs 20 LR χ2 48.61 R2 0.912
σ 0.8697 d.f. 10 R2adj 0.814
d.f. 9 Pr(>χ2) 0.0000 g 2.248

Residuals

       Min         1Q     Median         3Q        Max 
-1.510e+00 -2.150e-01 -6.939e-18  2.150e-01  1.510e+00 
β S.E. t Pr(>|t|)
Intercept   0.5100  0.6450 0.79 0.4495
drug=B   1.5800  0.3890 4.06 0.0028
id=2  -1.7000  0.8697 -1.95 0.0824
id=3  -0.8500  0.8697 -0.98 0.3540
id=4  -1.8500  0.8697 -2.13 0.0623
id=5  -1.4000  0.8697 -1.61 0.1419
id=6   2.6000  0.8697 2.99 0.0152
id=7   3.3000  0.8697 3.79 0.0043
id=8  -0.1000  0.8697 -0.11 0.9110
id=9   1.0000  0.8697 1.15 0.2799
id=10   1.4000  0.8697 1.61 0.1419
  • Ignore overall \(\chi^2\) test
  • \(\hat{\beta}\) equals the difference computed by the paired \(t\)-test
  • S.E. is also identical
  • Now run random intercepts model
Code
require(nlme)
summary(lme(y ~ drug, random = ~ 1 | id, data=d))
Linear mixed-effects model fit by REML
  Data: d 
       AIC      BIC    logLik
  77.95588 81.51737 -34.97794

Random effects:
 Formula: ~1 | id
        (Intercept)  Residual
StdDev:      1.6877 0.8697384

Fixed effects:  y ~ drug 
            Value Std.Error DF  t-value p-value
(Intercept)  0.75 0.6003979  9 1.249172  0.2431
drugB        1.58 0.3889588  9 4.062127  0.0028
 Correlation: 
      (Intr)
drugB -0.324

Standardized Within-Group Residuals:
        Min          Q1         Med          Q3         Max 
-1.63372282 -0.34157076  0.03346151  0.31510644  1.83858572 

Number of Observations: 20
Number of Groups: 10 
  • Try a regular regression model where intra-person correlation is adjusted for using the cluster sandwich covariance estimator
Code
f <- ols(y ~ drug, x=TRUE, y=TRUE)
robcov(f, id)
1
x=TRUE, y=TRUE needed for robcov, which is in the rms package

Linear Regression Model

ols(formula = y ~ drug, x = TRUE, y = TRUE)
Model Likelihood
Ratio Test
Discrimination
Indexes
Obs 20 LR χ2 3.52 R2 0.161
σ 1.8986 d.f. 1 R2adj 0.115
d.f. 18 Pr(>χ2) 0.0607 g 0.832
Cluster on id
Clusters 10

Residuals

   Min     1Q Median     3Q    Max 
-2.430 -1.305 -0.580  1.455  3.170 
β S.E. t Pr(>|t|)
Intercept  0.7500  0.5367 1.40 0.1793
drug=B  1.5800  0.3690 4.28 0.0004
  • Standard error is a bit too small

  • P-value is too small because the sandwich estimator leads to a \(z\)-statistic and we should probably be using a \(t\)-statistic with the usual d.f.

  • Now compare rank difference test with PO model

Code
f <- orm(y ~ drug + id, x=TRUE, y=TRUE)
anova(f, test='LR')
Likelihood Ratio Statistics for y
χ2 d.f. P
drug 22.33 1 <0.0001
id 42.46 9 <0.0001
TOTAL 46.27 10 <0.0001
  • \(\chi^2\) for treatment (drug) is a bit larger than expected
  • Try mixed PO model with normal random effects for subjects
  • This did not converge and resulted in unrealistically large test statistic
Code
require(ordinal)
f <- clmm2(factor(y) ~ drug, random = id, data=d, link='logistic', Hess=TRUE)
summary(f)
Cumulative Link Mixed Model fitted with the Laplace approximation

Call:
clmm2(location = factor(y) ~ drug, random = id, data = d, Hess = TRUE, 
    link = "logistic")

Random effects:
        Var  Std.Dev
id 9.965504 3.156819

Location coefficients:
      Estimate  Std. Error z value   Pr(>|z|)  
drugB    3.4639    0.0039   891.5393 < 2.22e-16

No scale coefficients

Threshold coefficients:
          Estimate  Std. Error z value  
-1.6|-1.2   -4.7115    1.6232    -2.9026
-1.2|-0.2   -3.4746    1.4104    -2.4636
-0.2|-0.1   -2.5543    1.3280    -1.9235
-0.1|0      -1.0962    1.1884    -0.9224
0|0.1       -0.5555    1.1367    -0.4887
0.1|0.7     -0.1081    1.1091    -0.0975
0.7|0.8      0.3318    1.0951     0.3030
0.8|1.1      1.3986    1.0786     1.2967
1.1|1.6      2.0828    1.0414     2.0000
1.6|1.9      2.7148    0.9830     2.7616
1.9|2        3.2579    0.9314     3.4979
2|3.4        3.8836    0.8593     4.5195
3.4|3.7      5.3998    0.0038  1433.6904
3.7|4.4      6.2450    0.0038  1660.5590
4.4|4.6      7.1126    0.7138     9.9650
4.6|5.5      8.4414    0.0039  2172.6229

log-likelihood: -50.12909 
AIC: 136.2582 
Condition number of Hessian: 909213.22 

clmm2 uses a Laplace approximation for numerical integration. Let’s try using 7-point quadrature integration instead.

Code
f <- clmm2(factor(y) ~ drug, random = id, data=d, link='logistic', Hess=TRUE,
           nAGQ=7)
summary(f)
Cumulative Link Mixed Model fitted with the adaptive Gauss-Hermite 
quadrature approximation with 7 quadrature points

Call:
clmm2(location = factor(y) ~ drug, random = id, data = d, Hess = TRUE, 
    link = "logistic", nAGQ = 7)

Random effects:
      Var  Std.Dev
id 9.7614 3.124324

Location coefficients:
      Estimate Std. Error z value Pr(>|z|) 
drugB  3.4093   1.2088     2.8205 0.0047944

No scale coefficients

Threshold coefficients:
          Estimate Std. Error z value
-1.6|-1.2 -4.6254   1.8980    -2.4370
-1.2|-0.2 -3.4202   1.6540    -2.0678
-0.2|-0.1 -2.5169   1.5306    -1.6443
-0.1|0    -1.0695   1.3617    -0.7854
0|0.1     -0.5393   1.3192    -0.4088
0.1|0.7   -0.1034   1.3011    -0.0795
0.7|0.8    0.3309   1.2972     0.2551
0.8|1.1    1.3813   1.3553     1.0192
1.1|1.6    2.0516   1.4343     1.4304
1.6|1.9    2.6785   1.5130     1.7703
1.9|2      3.2211   1.5777     2.0416
2|3.4      3.8351   1.6946     2.2631
3.4|3.7    5.3411   2.0913     2.5539
3.7|4.4    6.1718   2.2910     2.6940
4.4|4.6    7.0214   2.4485     2.8676
4.6|5.5    8.3565   2.8343     2.9483

log-likelihood: -49.85619 
AIC: 135.7124 
Condition number of Hessian: 525.3466 
  • This frequentist mixed model fit seems to be fine
  • A Bayesian mixed effects PO model did not have any trouble
  • Mimic a p-value by dividing the posterior mean log odds ratio by the posterior standard deviation of this \(\beta\)
  • Also get Bayesian posterior probability of an effect, which is more actionable than a p-value
Code
require(rmsb)
options(mc.cores = parallel::detectCores() - 1)   # use max # CPUs
cmdstanr::set_cmdstan_path('/usr/local/bin/cmdstan-2.34.1')

f <- blrm(y ~ drug + cluster(id), data=d, backend='cmdstan')
Running MCMC with 4 chains, at most 11 in parallel...

Chain 1 finished in 0.3 seconds.
Chain 2 finished in 0.3 seconds.
Chain 3 finished in 0.3 seconds.
Chain 4 finished in 0.4 seconds.

All 4 chains finished successfully.
Mean chain execution time: 0.3 seconds.
Total execution time: 0.5 seconds.
Code
f

Bayesian Proportional Odds Ordinal Logistic Model

Dirichlet Priors With Concentration Parameter 0.148 for Intercepts

blrm(formula = y ~ drug + cluster(id), data = d, backend = "cmdstan")

Frequencies of Responses

-1.6 -1.2 -0.2 -0.1    0  0.1  0.7  0.8  1.1  1.6  1.9    2  3.4  3.7  4.4  4.6 
   1    1    1    2    1    1    1    2    1    1    1    1    2    1    1    1 
 5.5 
   1 
Mixed Calibration/
Discrimination Indexes
Discrimination
Indexes
Rank Discrim.
Indexes
Obs 20 LOO log L -73.8±3.86 g 1.504 [0.608, 2.683] C 0.753 [0.753, 0.753]
Draws 4000 LOO IC 147.6±7.72 gp 0.018 [0, 0.071] Dxy 0.505 [0.505, 0.505]
Chains 4 Effective p 32.84±2.49 EV 0.016 [0, 0.065]
Time 0.9s B 0.048 [0.045, 0.05] v 2.425 [0.141, 6.019]
p 1 vp 0.001 [0, 0.005]
Cluster on id
Clusters 10
σγ 2.2124 [0.4341, 4.3747]
Mean β Median β S.E. Lower Upper Pr(β>0) Symmetry
 2.8650  2.8162  1.0696  0.8839  5.0272  0.9990  1.13
Code
draws <- f$draws[, 'drug=B']
beta <- mean(draws)
se   <- sd(draws)
z    <- beta / se
p    <- 2 * (1 - pnorm(abs(z)))
P    <- mean(draws > 0)
round(c(beta=beta, se=se, z=z, p=p, 'Pr(beta > 0)'=P), 3)
        beta           se            z            p Pr(beta > 0) 
       2.865        1.070        2.679        0.007        0.999 
  • Compare the p=0.007 with the rank difference tests’s p=0.009

  • Note that \(\beta\) is on the log-odds scale and cannot be compared to a difference in means

  • Semiparametric models can be used to estimate means, quantiles, etc. (Section 7.8)

  • Try an ordinary PO model with a cluster sandwich covariance estimator

Code
f <- orm(y ~ drug, x=TRUE, y=TRUE)
robcov(f, id)

Logistic (Proportional Odds) Ordinal Regression Model

orm(formula = y ~ drug, x = TRUE, y = TRUE)

Frequencies of Responses

-1.6 -1.2 -0.2 -0.1    0  0.1  0.7  0.8  1.1  1.6  1.9    2  3.4  3.7  4.4  4.6 
   1    1    1    2    1    1    1    2    1    1    1    1    2    1    1    1 
 5.5 
   1 
Model Likelihood
Ratio Test
Discrimination
Indexes
Rank Discrim.
Indexes
Obs 20 LR χ2 3.82 R2 0.174 ρ 0.425
Distinct Y 17 d.f. 1 R21,20 0.131
Cluster on id Pr(>χ2) 0.0507 R21,19.9 0.132
Clusters 10 Score χ2 3.74 |Pr(Y ≥ median)-½| 0.183
Y0.5 0.95 Pr(>χ2) 0.0531
max |∂log L/∂β| 4×10-5
β S.E. Wald Z Pr(>|Z|)
drug=B  1.5916  0.4831 3.29 0.0010
  • The result is reasonable
  • Simulations show that the cluster sandwich estimator of the standard error of the estimated effect has excellent operating characteristics

7.10 Power and Sample Size

7.10.1 Normal Approximation

  • Common to get power/sample size estimates for the Wilcoxon two-sample comparison using the unpaired \(t\)-test power formula
  • Are assuming normality and (usually) equal variances
  • To reflect the slight inefficiency of the Wilcoxon two-sample test if normality were to magically hold, multiply the \(t\)-test sample size by \(\frac{3}{\pi} = 1.047\)
  • When the response within-group distribution is far from normal this approach is suspect
    • e.g., \(Y\) has many ties at one value, has a floor or ceiling effect, is asymmetric, or has heavy tails
  • Need a general approach

7.10.2 More on Relative Efficiency

  • Relative efficiency of \(\frac{3}{\pi}\) for the Wilcoxon 2-sample test can be derived as a correlation coefficient
  • As \(n \rightarrow \infty\) it is the squared correlation between the weights Wilcoxon gives to order statistics (sorted data values) and the optimal weights
  • Wilcoxon is a linear rank statistic with weights equal to ordinary ranks
  • Optimal linear rank test (normal scores test) for a normal distribution uses the probit link (normal inverse weights), i.e., \(\Phi^{-1}(\frac{\mathrm{ranks}}{n+1})\)
  • Compute correlation of ordinary ranks with normal scores
Code
for(n in c(10, 100, 1000, 10000, 100000, 1000000)) {
  ranks  <- 1 : n
  zranks <- qnorm(ranks / (n + 1))
  cat('n:', n, '  r2:', cor(ranks, zranks)^2, '\n')
}
n: 10   r2: 0.9923288 
n: 100   r2: 0.9688625 
n: 1000   r2: 0.958053 
n: 10000   r2: 0.9554628 
n: 1e+05   r2: 0.955008 
n: 1e+06   r2: 0.9549402 
Code
cat('3/pi: ', 3 / pi, '\n')
3/pi:  0.9549297 

7.10.3 Tailoring Power Calculations to the Wilcoxon Test

  • Whitehead Whitehead (1993) derived simple formulas related to the proportional odds model score test
  • Formulas assume that a frequency-tabulated distribution estimate is available for the combined groups
  • Power is computed as a function of the group 2 : group 1 odds ratio for exceedance probabilities
  • See example below for conversion of ORs to differences in means or medians
    • OR=1 \(\rightarrow\) distributions are the same, so differences in means/medians are zero
  • See R Hmisc package popower and posamsize functions

7.10.4 Discrete \(Y\)

  • Example: response variable has clumping at zero (with prob. 0.3) and is otherwise uniformly distributed over the values 1, 2, 4, 8, 16, 32, 64
    • note: actual data values do not affect power Calculations
    • don’t come into play until translate to means/medians
Code
p <- c(.3, rep(.1, 7))
popower(p, 1.25, 1000)  # compute power to detect OR=1.25, combined N=1000
Power: 0.516 
Efficiency of design compared with continuous response: 0.966 
Approximate standard error of log odds ratio: 0.1116 
Code
posamsize(p, 1.25, power=0.9)  # N for power=0.9
Total sample size: 2621.4 
Efficiency of design compared with continuous response: 0.966 
  • Show how cell probabilities are translated by OR=1.25, and compute the mean and median of Y for a series of ORs for simpler interpretation
Code
pomodm(p=p, odds.ratio=1.25)
[1] 0.25531915 0.09250694 0.09661836 0.10101010 0.10570825 0.11074197 0.11614402
[8] 0.12195122
Code
x <- c(0, 2 ^ (0:6))
sum(p * x)  # check mean with OR=1
[1] 12.7
Code
ors <- c(1, 1.05, 1.1, 1.2, 1.25, 1.5, 2)
w <- matrix(NA, nrow=length(ors), ncol=2,
            dimnames=list(OR=ors, c('mean', 'median')))
i <- 0
for(or in ors) {
  i <- i + 1
  w[i, ] <- pomodm(x, p, odds.ratio=or)
}
w
      
OR         mean   median
  1    12.70000 3.000000
  1.05 13.14602 3.364286
  1.1  13.58143 3.709091
  1.2  14.42238 4.350000
  1.25 14.82881 4.650000
  1.5  16.73559 6.000000
  2    20.03640 9.900000

7.10.5 Gaussian \(Y\)

  • Suppose response variable for control group has a normal distribution with mean 100 and SD 10
  • Start by assuming the experimental arm has the same distribution as control except with the mean shifted upwards 3 units
  • This will result in non-proportional odds so the Wilcoxon test is not optimal but will still be 0.95 efficient
  • When the sample size per group is 150, the power of the \(t\)-test to detect a 3-unit difference in means is:
Code
require(pwr)
pwr.t.test(d=3 / 10, n=150, sig.level=0.05, type='two.sample')

     Two-sample t test power calculation 

              n = 150
              d = 0.3
      sig.level = 0.05
          power = 0.7355674
    alternative = two.sided

NOTE: n is number in *each* group
  • To get the power of the Wilcoxon test when both populations have a normal distribution, we can easily use simulation
Code
getRs('hashCheck.r')   # defines runifChanged
s <- 1000    # number of simulated trials
pval <- numeric(s)
set.seed(1)  # so can reproduce results
for(i in 1 : s) {
  y1 <- rnorm(150, 100, 10)
  y2 <- rnorm(150, 103, 10)
  w  <- wilcox.test(y1, y2)
  pval[i] <- w$p.value
  }
mean(pval < 0.05)   # proportion of simulations with p < 0.05
[1] 0.713
Code
# Simulate the power by actually running the prop. odds model 300 times
g <- function() simRegOrd(300, nsim=400, delta=3, sigma=10)$power    # slower
as.vector(runifChanged(g))
[1] 0.71
  • For the Wilcoxon test to be optimal (PO holds) shifting the control distribution by an odds ratio will result in a non-Gaussian distribution for the experimental arm
  • Solve for the odds ratio that shifts the mean from 100 to 103, assume PO and compute the power
Code
# Use an arbitrary large sample to mimic population computations
m   <- 200000
y1  <- sort(rnorm(m, 100, 10))
ors <- means <- seq(1, 4, by=.025)
i   <- 0
for(or in ors) {
  i <- i + 1
  means[i] <- pomodm(y1, rep(1/m, m), odds.ratio=or)['mean']
}
needed.or <- approx(means, ors, xout=103)$y
needed.or
[1] 1.706672
Code
ggplot(mapping=aes(ors, means)) + geom_line() +
  geom_hline(yintercept=103, alpha=I(0.25)) +
    geom_vline(xintercept=needed.or, alpha=I(0.25)) +
  xlab('Group B:A Odds Ratio') +
    ylab('Mean in Population B')
# Compute power at that odds ratio assuming no ties in data
popower(rep(1/300, 300), odds.ratio=needed.or, n=300)
Power: 0.759 
Efficiency of design compared with continuous response: 1 
Approximate standard error of log odds ratio: 0.2007 
Figure 7.8: Relationship between odds ratio and assumed mean in the experimental arm
  • Check how non-normal the experimental arm responses would be if PO holds and OR=10
Code
# First do this theoretically
# Control arm has Gaussian Y with mean 100, SD 10
# Create experimental arm distribution using OR=10
y    <- seq(60, 145, length=150)
Fy   <- 1 - pnorm(y, mean=100, sd=10)    # P(Y >= y | group A)
Gy   <- 1 - plogis(qlogis(Fy) + log(10)) # P(Y >= y | group B)
qu   <- approx(Gy, y, xout=c(0.25, 0.5, 0.75))$y
qu    # Q1, median, Q2
[1] 107.3628 113.3506 118.4894
Code
s    <- (qu[3] - qu[1]) / (qnorm(0.75) - qnorm(0.25))
mu   <- qu[1] - s * qnorm(0.25)
ncdf <- pnorm(y, mean=mu, sd=s)
d    <- rbind(data.frame(Distribution='logistic', x=y, y=Gy),
              data.frame(Distribution='normal',   x=y, y=ncdf))
ggplot(d, aes(x, y, col=Distribution)) + geom_line() +
  xlab(expression(y)) + ylab(expression(P(Y <= y)))
Figure 7.9: Assessing departure from normality in the experimental arm that is induced by assuming proportional odds instead of a shift in means. The nearly equivalent normal distribution was determined by matching quartiles of the distribution to the logistic distribution.
Code
ggplot(mapping=aes(x=y, y=qnorm(Gy))) + geom_line() +
  geom_smooth(method=lm, se=FALSE, col=I('blue')) +
    xlab(expression(y))
Figure 7.10: Theoretical \(q-q\) plot assessing normality of experimental arm distribution induced by proportional odds
Code
p <- pomodm(p=rep(1/m, m), odds.ratio=10)
range(p)  # control arm: all 1/200000
[1] 5.000023e-07 4.999773e-05
Code
wtd.mean(y1, p)  # mean shifted by about 12 units
[1] 112.4071
Code
# Form new distribution by repeating each observation a number
# of times equal to the ratio of the new probability to the
# minimum of all new probabilities
y2 <- rep(y1, round(p / min(p)))  # 2M obs instead of 200K
mean(y2)
[1] 112.4658
Code
quantile(y2, c(.25, .5, .75))
     25%      50%      75% 
107.4406 113.3427 118.4725 
Code
# The following plot is similar to the previous one
ncdf <- pnorm(y, mean=mean(y2), sd=sd(y2))
ggplot(mapping=aes(x=y2)) + stat_ecdf(geom='step') +
  geom_line(mapping=aes(x=y, y=ncdf, col=I('blue'))) +
    xlab(expression(y)) + ylab(expression(Proportion <= x))
Figure 7.11: Creation of 200,000 point discrete distribution for the experimental arm, obtained by converting the control distribution under proportional odds. Blue curve is the normal cumulative distribution function; black curve is for the experimental arm assuming a proportional odds shift from the control arm.
  • Little non-normality of the 2nd group if the treatment effect operates by multiplying the odds that \(Y \geq y\) instead of incrementing the mean
    • relates to similarity of normal and logistic distributions

7.10.6 Heavy-tailed \(Y\)

  • Get power to detect a shift in mean of 0.3 units for a heavy-tailed control distribution (t with 4 d.f.) with 150 subjects per group
  • Loss of efficiency of \(t\)-test
    • mean and SD are no longer optimal data summaries
  • Can use above method to compute Wilcoxon power quickly if willing to assume PO
  • Let’s not assume PO, and instead use simulation
  • Compare with power of \(t\)-test
  • Do for both null and non-null cases to verify type I error prob.
Code
s <- 1000    # number of simulated trials
pvalt <- pvalw <- numeric(s)
set.seed(1)  # so can reproduce results
for(delta in c(0, 0.3)) {
  for(i in 1 : s) {
    y1 <- rt(150, 4)
    y2 <- rt(150, 4) + delta
    pvalt[i] <- t.test(y1, y2)$p.value
    pvalw[i] <- wilcox.test(y1, y2)$p.value
  }
  cat('Delta:', delta, '\n')
  P <- function(x) round(mean(x), 2)
  cat('Proportion of simulations with W p-value < t p-value:',
      P(pvalw < pvalt), '\n')
  cat('Mean p-value for t:', P(pvalt), '\n')
  cat('Mean p-value for W:', P(pvalw), '\n')
  cat('Power for t:', P(pvalt < 0.05), '\n')
  cat('Power for W:', P(pvalw < 0.05), '\n\n')
}
Delta: 0 
Proportion of simulations with W p-value < t p-value: 0.51 
Mean p-value for t: 0.5 
Mean p-value for W: 0.49 
Power for t: 0.05 
Power for W: 0.06 

Delta: 0.3 
Proportion of simulations with W p-value < t p-value: 0.73 
Mean p-value for t: 0.17 
Mean p-value for W: 0.12 
Power for t: 0.47 
Power for W: 0.6 
  • Hmisc simRegOrd function can also simulate power for an adjusted two-sample comparison if there is one adjustment covariate

7.11 Bayesian Proportional Odds Model

  • PO model and other cumulative probability semiparametric ordinal regression models are readily extended to a Bayesian framework
  • Need special care in selecting priors for the intercepts for the continuous \(Y\) case
  • Nathan James of Vanderbilt University has an implementation using Stan available at
    github.com/ntjames/bayes_cpm
  • See also the R brms package: bit.ly/brms-ordinal and this discussion:
    github.com/paul-buerkner/brms/issues/762
  • Bayesian version of the Wilcoxon test is the posterior probability that \(\beta_1 > 0\) in the PO model
  • Advantages of Bayes for PO models:
    • does not need approximations such as large sample normality of \(\hat{\beta}\) or \(\chi^{2}\) distribution approximation to likelihood ratio test statistic
    • inference is more interpretable and direct
    • can bring outside information to the analysis
    • can incorporate shrinkage/penalization/skepticism and still have exact inference
    • automatically obtain exact distributions and credible intervals for derived quantities10, e.g. mean, quantiles, differences in means and quantiles, differences in exceedance probs, \(P(Y=y | X)\)
    • can relax PO assumption without huge instabilities that result from using polytomous logistic models; prior distributions can favor PO while allowing non-PO

10 One merely takes each posterior sample for the \(\alpha\)s and \(\beta\)s and computes the quantity of interest, thereby automatically generating posterior samples for the derived quantity for which quantiles can compute credible intervals, etc.

7.12 Sample Size Requirement for Characterizing Entire Distributions

The empirical cumulative distribution function (ECDF) was introduced in Section 4.3.3.2. It is a nonparametric estimator of the unknown CDF for a (primarily) continuous variable. Inverting the ECDF gives traditional sample quantile estimates, so if one wants to estimate all possible quantiles accurately one needs to estimate the ECDF accurately over its entire range.

Kolmogorov and Smirnov developed a nonparametric test for whether a sample comes from a specific parametric CDF, by computing the maximum absolute difference between the ECDF and the postulated CDF. This test can be inverted to obtain simultaneous confidence bands for the unknown CDF. The global margin of error for an ECDF is the Kolmogorov-Smirnov critical value, for a certain \(\alpha\).

Kjetil Halvorsen and Martin Maechler have an R function in their sfsmisc package that provides a fast approximation to the K-S critical values when \(\alpha=0.05\) (for 0.95 confidence bands). This function is reproduced below, and we use it to compute the 0.95-level simultaneous margins of error for the ECDF as a function of sample size.

This is a good opportunity to check the universal rule that the precision of an estimate is proportional to the square root of the sample size. The approxmoe function below multiplies the accurate margin of error at \(n=20\) by the square root of the ratio of 20 to the sample size of interest.

Code
KSd <- function(n)
{
    ## `approx.ksD()'
    ## approximations for the critical level for Kolmogorov-Smirnov statistic D,
    ## for confidence level 0.95. Taken from Bickel & Doksum, table IX, p.483
    ## and Lienert G.A.(1975) who attributes to Miller,L.H.(1956), JASA
    ifelse(n > 80,
           1.358/( sqrt(n) + .12 + .11/sqrt(n)),## Bickel&Doksum, table IX,p.483

           splinefun(c(1:9, 10, 15, 10 * 2:8),# from Lienert
                     c(.975,   .84189, .70760, .62394, .56328,# 1:5
                       .51926, .48342, .45427, .43001, .40925,# 6:10
                       .33760, .29408, .24170, .21012,# 15,20,30,40
                       .18841, .17231, .15975, .14960)) (n))
}
plot(10 : 10000, KSd(10 : 10000), type='l', xlab='n',
     ylab='ECDF Margin of Error')
abline(h=c(0.025, 0.05, 0.1), col=gray(0.6))
ns <- c(20, 50, 100, 180, 250, 500, 730, 750, 1000, 2935, 5000, 10000, 18400)
approxmoe <- function(n) 0.29408 * sqrt(20 / n)
data.frame(n=ns, MOE=round(KSd(ns), 3), MOEsqrt=round(approxmoe(ns), 3))
       n   MOE MOEsqrt
1     20 0.294   0.294
2     50 0.188   0.186
3    100 0.134   0.132
4    180 0.100   0.098
5    250 0.085   0.083
6    500 0.060   0.059
7    730 0.050   0.049
8    750 0.049   0.048
9   1000 0.043   0.042
10  2935 0.025   0.024
11  5000 0.019   0.019
12 10000 0.014   0.013
13 18400 0.010   0.010
Figure 7.12: Margin of error for an ECDF based on Kolmogorov-Smirnov critical values

The sample size required to have a margin of error of \(\pm 0.01\) for the entire ECDF is 18,400. In other words if one desired to have a high degree of confidence that no estimated quantile will be off by more than a probability of 0.01, \(n=18,400\) will achieve this. For example, the sample 0.25 quantile (first quartile) may correspond to population quantiles 0.24-0.26. To achieve a \(\pm 0.1\) MOE requires \(n=180\), and to have \(\pm 0.05\) requires \(n=730\).

The “universal \(\sqrt{}\) rule” worked quite well. This will allow us to extend the calculations to any confidence level. The built-in R function qsmirnov allows us to compute exact critical values of the K-S test by computing the critical value of the Smirnov test when one of the two distributions it compares has an infinite sample size. Here \(n=10000\) is close enough to \(\infty\).

Code
qsmirnov(0.95, c(20, 10000))   # took 11s
[1] 0.2943706
Code
KSd(20)
[1] 0.29408
Code
# Now use alpha=0.1 instead of 0.05
qsmirnov(0.9, c(20, 10000))
[1] 0.2650735
Code
approxmoe90 <- function(n) 0.2652735 * sqrt(20 / n)
round(approxmoe90(c(20, 100, 1000)), 3)
[1] 0.265 0.119 0.038