8 Correlation and Nonparametric Regression

8.1 Overview

Outcome	Predictor	Normality?	Linearity?	Analysis Method
Interval	Binary	Yes		2-sample \(t\)-test or linear regression
Ordinal	Binary	No		Wilcoxon 2-sample test or proportional odds model
Categorical	Categorical			Pearson \(\chi^2\) test
Interval	Interval	Yes	Yes	Correlation or linear regression
Ordinal	Ordinal	No	No	Spearman’s rank correlation

Examine association between continuous/interval outcome (\(y\)) and continuous/interval predictor (\(x\))
Scatterplot of \(y\) versus \(x\)
Nonparametric smoother, including local linear regression, “super smoother”, and overlapping moving windows for any statistic of interest

8.2 Pearson’s correlation coefficient

\(r = \frac{\Sigma(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\Sigma(x_i - \bar{x})^2\Sigma(y_i - \bar{y})^2}}\)
Range: \(-1 \leq r \leq 1\)
Correlation coefficient is a unitless index of strength of association between two variables (+ = positive association, - = negative, 0 = no association)
Measures the linear relationship between \(X\) and \(Y\)
Can test for significant association by testing whether the population correlation is zero \[t = \frac{r\sqrt{n-2}}{\sqrt{1-r^{2}}}\] which is identical to the \(t\)-test used to test whether the population \(r\) is zero; d.f.=\(n-2\).
Use probability calculator for \(t\) distribution to get \(P\)-value (2-tailed if interested in association in either direction)
1-tailed test for a positive correlation between \(X\) and \(Y\) tests \(H_{0}:\) when \(X \uparrow\) does \(Y \uparrow\) in the population?
Confidence intervals for population \(r\) calculated using Fisher’s \(Z\) transformation \[Z = \frac{1}{2} \mathrm{log}_\mathrm{e} \left( \frac{1+r}{1-r} \right)\]
- For large \(n\), Z follows a Normal distribution with standard error \(\frac{1}{\sqrt{n-3}}\)
- To calculate a confidence interval for \(r\), first find the confidence interval for \(Z\) then transform back to the \(r\) scale \[\begin{array}{ccc} Z & = & \frac{1}{2} \mathrm{log}_\mathrm{e} \left( \frac{1+r}{1-r} \right) \\ 2*Z & = & \mathrm{log}_\mathrm{e} \left( \frac{1+r}{1-r} \right) \\ \mathrm{exp}(2Z) & = & \left( \frac{1+r}{1-r} \right) \\ \mathrm{exp}(2Z) * (1-r) & = & 1 + r \\ \mathrm{exp}(2Z) - r * \mathrm{exp}(2Z) & = & 1 + r \\ \mathrm{exp}(2Z) - 1 & = & r * \mathrm{exp}(2Z) + r \\ \mathrm{exp}(2Z) - 1 & = & r \left(\mathrm{exp}(2Z) + 1\right) \\ \frac{\mathrm{exp}(2Z) - 1}{\mathrm{exp}(2Z) + 1} & = & r \\ \end{array}\]
Example (Altman 89-90): Pearson’s \(r\) for a study investigating the association of basal metabolic rate with total energy expenditure was calculated to be \(0.7283\) in a study of \(13\) women. Derive a 0.95 confidence interval for \(r\). \[ Z = \frac{1}{2} \mathrm{log}_\mathrm{e} \left( \frac{1+0.7283}{1-0.7283} \right) = 0.9251 \]

The lower limit of a 0.95 CI for \(Z\) is given by \[ 0.9251 - 1.96 \times \frac{1}{\sqrt{13-3}} = 0.3053 \] and the upper limit is \[ 0.9251 + 1.96 \times \frac{1}{\sqrt{13-3}} = 1.545 \] A 0.95 CI for the population correlation coefficient is given by transforming these limits from the \(Z\) scale back to the \(r\) scale \[ \frac{\mathrm{exp}(2\times 0.3053) - 1}{\mathrm{exp}(2\times 0.3053) + 1} ~~~ \mathrm{to} ~~~ \frac{\mathrm{exp}(2\times 1.545) - 1}{\mathrm{exp}(2\times 1.545) + 1} \] Which gives a 0.95 CI from 0.30 to 0.91 for the population correlation

Code

n <- 13
r <- 0.7283
z.transform <- 0.5 * log((1 + r) / (1 - r))
clz <- z.transform + c(-1, 1) * qnorm(0.975) / sqrt(n - 3)
clr <- (exp(2 * clz) - 1) / (exp(2 * clz) + 1)
round(c(z.transform, clz, clr), 4)

[1] 0.9251 0.3053 1.5449 0.2962 0.9129

8.3 Spearman’s Rank Correlation

Pearson’s \(r\) assumes linear relationship between \(X\) and \(Y\)
Spearman’s \(\rho\) (sometimes labeled \(r_{s}\)) assumes monotonic relationship between \(X\) and \(Y\)
- when \(X \uparrow\) \(Y\) always \(\uparrow\) or stays flat, or \(Y\) always \(\downarrow\) or stays flat
- does not assume linearity
\(\rho = r\) once replace column of \(X\)s by their ranks and column of \(Y\)s by ranks
To test \(H_{0}:\rho=0\) without assuming linearity or normality, being damaged by outliers, or sacrificing much power (even if data are normal), use a \(t\) statistic: \[t = \frac{\rho\sqrt{n-2}}{\sqrt{1-\rho^{2}}}\] which is identical to the \(t\)-test used to test whether the population \(r\) is zero; d.f.=\(n-2\).
Use probability calculator for \(t\) distribution to get \(P\)-value (2-tailed if interested in association in either direction)
1-tailed test for a positive correlation between \(X\) and \(Y\) tests \(H_{0}:\) when \(X \uparrow\) does \(Y \uparrow\) in the population?

8.4 Correlation Examples

Correlation difficult to judge by eye
Example plots on following pages

Code

# Generate 50 data points with Population correlations of 0, .2, .4, .6,
# .8, and .9 and plot results
require(ggplot2)
n <- 50
set.seed(123)
x <- rnorm(n, 5, 1)
d <- expand.grid(x=x, R=c(0, .2, .4, .6, .8, .9))
d <- transform(d, y = x + rnorm(nrow(d), 0,
                      ifelse(R == 0, 5, sqrt(R ^ -2 - 1))))
sfun <- function(i) {
  x <- d$x[i]; y <- d$y[i]; R <- d$R[i][1]
  r <- cor(x, y)
  tr <- r * sqrt(n - 2) / sqrt(1 - r^2)
  rho <- cor(rank(x), rank(y))
  trho <- rho * sqrt(n - 2) / sqrt(1 - rho^2)
  label <- paste('True r:', R[1], '  r:', round(r,2), '  t:', round(tr,2),
        '  rho:', round(rho,2), '  t:', round(trho,2), sep='')
  names(label) <- R
  label
  }
stats <- tapply(1 : nrow(d), d$R, sfun)
d$stats <- factor(stats[as.character(d$R)], unique(stats))

ggplot(d, aes(x=x, y=y)) + geom_point() + facet_wrap(~ stats) +
  theme(strip.text.x = element_text(size=7))

Figure 8.1: Samples of size \(n=50\) for X and Y are drawn from bivariate normal populations with true correlations ranging from 0.0 to 0.9. Pearson and Spearman sample correlations are shown for samples of size 50. Besides the population correlation coefficient, each panel is labeled with the estimated Pearson \(r\), its \(t\) statistic, the estimated Spearman \(\rho\), and its \(t\) statistic

Code

# Different scenarios that can lead to a correlation of 0.7
spar(mfrow=c(3,2))
set.seed(123)
rho <- 0.7; n <- 50
var.eps <- rho^-2 - 1
x <- rnorm(n, 5, 1)
y <- x + rnorm(n, 0, sqrt(var.eps))
cor(x,y)

[1] 0.6951673

Code

plot(x,y,xlab='',ylab='')

x <- c(1:20,30)
y <- c(1:20,6.2)
cor(x,y)

[1] 0.6988119

Code

plot(x,y,xlab='',ylab='')

set.seed(123)
x <- rnorm(40)
y <- rnorm(40)
x[21] <- y[21] <- 8.5
cor(x,y)

[1] 0.7014825

Code

plot(x,y,xlab='',ylab='')

x <- rep(0:19,2)
y <- c(rep(.62,20),rep(2,20)) * x
cor(x,y)

[1] 0.701783

Code

plot(x,y,xlab='',ylab='')

x <- -7:12
y <- x^2
cor(x,y)

[1] 0.6974104

Code

plot(x,y,xlab='',ylab='')

set.seed(123)
tmp <- 1:20 / 2
x <- c(rnorm(20, tmp, 1), tmp + rnorm(20,14.5,1))
y <- c(rnorm(20, -tmp, 1), -tmp + rnorm(20,14.5,1))
cor(x,y)

[1] 0.703308

Code

plot(x,y,xlab='',ylab='')

Figure 8.2: Different observed datasets that have the same correlation. All six plots have a sample Pearson correlation of \(0.7\).

8.5 Correlation and Agreement

Compare two methods of measuring the same underlying value
- Lung function measured using a spirometer (expensive, accurate) or peak flow meter (cheap, less accurate)
- Two devices (oropharyngeal and conventional) used to measure acidity (pH) in the esophagus as a marker of reflux
Typical (incorrect) approach begins with scatterplot of one method vs. the other with a 1:1 line indicating perfect agreement
See ?fig-descript-ph
Incorrect approach would report a high correlation (\(r = 0.90\)) and conclude good agreement
Problems with the correlation approach

\(r\) measures the degree of linear association between two variables, not the agreement. If, for example, the Sandhill consistently gave pH values that were 0.5 unit higher than the Restech, we could still have high correlation, but poor agreement between the two devices. We can have high correlation if the two devices lie closely to any line, not just a 1:1 line that indicates perfect agreement.
A change in scale does not affect correlation, but does influence agreement. For example, if the Sandhill always registered 2 times larger than the Restech, we would have perfect correlation but the agreement would get progressively worse for larger values of pH.
Correlation depends on the range of the data so that larger ranges lead to larger correlations. This can lead to vary strange interpretations

Pearson (\(r\)) and Spearman (\(\rho\)) correlations for Restech and Sandhill pH data. The correlation calculated using all of the data is larger than the correlation calculated using a restricted range of the data. However, it would be difficult to claim that the overall agreement is better than both the agreement when pH is less than 4 and when pH is greater than 4.
	\(r\)	\(\rho\)
all data	0.90	0.73
avg pH \(\leq 4\)	0.51	0.58
avg pH \(> 4\)	0.74	0.65

Tests of significance (testing if \(r=0\)) are irrelevant to the question at hand, but often reported to demonstrate a significant association. The two devices are measuring the same quantity, so it would be shocking if we did not observe a highly significant \(p\)-value. A \(p < .0001\) is not impressive. A regression analysis with a highly significant slope would be similarly unimpressive.
Data can have high correlation, but poor agreement. There are many examples in the literature, but even in our analysis with \(r = 0.90\), the correlation is high, but we will show that the agreement is not as good as the high correlation implies.

See Chapter 16 for simple approaches to assessing agreement and analyzing observer variability studies.

8.5.1 Bland-Altman Plots

See Bland and Altman (1986, Lancet)
Earlier: Tukey mean-difference plot
Create plots of the difference in measurements on the y-axis versus the average value of the two devices on the x-axis
If the two devices agree, the difference should be about zero
The average of the two devices is our best estimate of the true, unknown (pH) value that is we are trying to measure
Measurements will often vary in a systematic way over the range of measurement. By plotting the difference versus the average, we can visually determine if the difference changes over our estimate of the truth.
Solid line indicated the mean, dashed lines are approximate 0.95 confidence intervals (assuming Normality)

But there is controversy about what should be on the \(x\)-axis of the plot. Krouwer Krouwer (2008) concluded that:

When the two measures have nearly equal variability, i.e., when comparing two “field measurements”, the Bland-Altman approach is preferred
When one measurement is a “reference standard” having much less variation than the field measurement, the reference standard and not the average of the two measurements should be on the \(x\)-axis

Code

require(Hmisc)
getHdata(esopH)
esopH$diff <- with(esopH, orophar - conv)
ggplot(esopH, aes(x=(conv + orophar)/2, y=diff)) +
  stat_binhex(aes(fill=Hmisc::cut2(..count.., g=20)),
              bins=80) +
  stat_smooth() +
  geom_hline(yintercept = mean(esopH$diff, na.rm=TRUE) +
   c(-1.96, 0, 1.96) * sd(esopH$diff, na.rm=TRUE),
   linetype=c(2,1,2), color='brown') +
  xlab('Average of Conventional and Oropharyngeal pH') +
  ylab('Oropharyngeal Minus Conventional pH') +
  guides(fill=guide_legend(title='Frequency'))

Figure 8.3: Bland-Altman plot for the oroesophageal and conventional pH measurements, using hexagonal binning because of the large sample size. The difference in pH mesaurements (oro. -conventional) is presented on the \(y\)-axis and the average of the two devices on the \(x\)-axis. We see poor agreement around pH values of 4-5

We will also consider differences in the two measurements over the time of day
The added smooth curve is called a locally weighted scatterplot smooth (loess)

Code

getHdata(esopH2)
ggplot(esopH2, aes(x=time, y=diffpH)) +
       geom_point(pch='.') + stat_smooth() +
       geom_hline(yintercept = 0, col='gray60') +
       scale_x_continuous(breaks=seq(16, 38, by=4),
                          labels=c("4 PM", "8 PM", "12 AM",
                            "4 AM", "8AM", "12 PM"),
                          limits=c(14, 14+24)) +
       ylab('Average of Oropharyngeal Minus Conventional pH') +
       xlab('Time of Day')

Figure 8.4: Difference in pH measurements (oro. - conventional) by time of day along with a loess smoother and pointwise 0.95 confidence bands. Is the difference modified by a subject being in a supine position rather than being upright?

8.5.2 Sample Size for \(r\)

Without knowledge of population variances, etc., \(r\) can be useful for planning studies
Choose \(n\) so that margin for error (half-width of C.L.) for \(r\) is acceptable
Precision of \(r\) in estimating \(\rho\) is generally worst when \(\rho=0\)
This margin for error as well as that for three other choices of the unknown true \(\rho\) are shown in Figure 8.5.

Code

require(Hmisc)
plotCorrPrecision(rho=c(0, .25, .5, .75),
                  n=seq(10, 1000, length=100),
                  ylim=c(0, .4), col=1:4, opts=list(keys='lines'))
abline(h=seq(0, .4, by=0.025),
       v=seq(25, 975, by=25), col=gray(.9))

Figure 8.5: Margin for error (length of longer side of asymmetric 0.95 confidence interval) for \(r\) in estimating \(\rho\), when \(\rho=0, 0.25, 0.5, 0.75\). Calculations are based on Fisher \(z\) transformation of \(r\).

Another way to to compute needed sample size is to solve for \(n\) such that the probability that the observed correlation is in the right direction is high. Again using Fisher’s \(z\)-transformation \(z(r) = 0.5 \log{\frac{1 + r}{1 - r}}\), and letting \(r\) and \(\rho\) denote the sample and population correlation coefficients, respectively, \(z(r)\) is approximately normally distributed with mean \(z(\rho)\) and standard deviation \(\frac{1}{\sqrt{n-3}}\). The probability that \(r > 0\) is the same as \(\Pr(z(r) > 0)\), which equals \(\Pr(z(r) - z(\rho) > -z(\rho))\) = \(\Pr(\sqrt{n-3} z(r) - \sqrt{n-3} z(\rho) > -\sqrt{n-3} z(\rho))\) = \(\Phi(\sqrt{n-3}z(\rho))\), where \(\Phi\) is the \(n(0,1)\) cumulative distribution function. If we desired a probability of 0.95 of \(r\) having the right sign, we set the right hand side to 0.95 and solve for \(n\):

\[n = (\frac{\Phi^{-1}(0.95)}{z(\rho)})^{2} + 3\]

When the true correlation \(\rho\) is close to 0.0 it makes sense that the sample size required to be confidence that \(r\) is in the right direction is very large, and the direction will be very easy to determine if \(\rho\) is far from zero. So we are most interested in \(n\) when \(\rho \in [0.1, 0.5]\). The approximate sample sizes required for various \(\rho\) are shown in the graph below.

Code

z   <- function(x) 0.5 * log((1 + x) / (1 - x))
rho <- seq(0.1, 0.5, by=0.01)
n   <- 3 + (qnorm(0.95) / z(rho)) ^ 2
ggplot(mapping=aes(x=rho, y=n)) + geom_line() +
  scale_x_continuous(minor_breaks=seq(0.1, 0.5, by=0.01)) +
  scale_y_continuous(minor_breaks=seq(10, 270, by=10)) +
  xlab(expression(rho))

Figure 8.6: Sample size required to ensure a high probability of the sample correlation coefficient being in the right direction

To have at least a 0.95 chance of being in the right direction when the true correlation is 0.1 (or -0.1) requires \(n > 250\).

For a given \(n\) we can compute the probability of being in the right direction for a few values of \(\rho\) as shown below.

Code

d <- expand.grid(n=10:400, rho=c(0.01, 0.05, 0.1, 0.2, 0.3, 0.4, 0.5))
d <- transform(d, p=pnorm(sqrt(n - 3) * z(rho)))
ggplot(d, aes(x=n, y=p, color=factor(rho))) + geom_line() +
  scale_x_continuous(minor_breaks=seq(10, 390, by=10)) +
  scale_y_continuous(minor_breaks=seq(0.5, 1, by=0.01)) +
  ylab('Pr(correct sign on r)') +
  guides(color=guide_legend(title=expression(rho)))

Figure 8.7: Probability of having the correct sign on \(r\) as a function of \(n\) and true correlation \(\rho\)

8.5.3 Comparing Two \(r\)’s

Rarely appropriate
Two \(r\)’s can be the same even though slopes may differ
Usually better to compare effects on a real scale (slopes)

8.6 Avoiding Overinterpretation

Often researchers
- compute many correlations then
- make a big deal out of the largest observed correlation
This is double dipping: using the same dataset to tell you which features to test, then testing those features
This is a ranking and selection problem, and the data seldom contain enough information to be reliable in the choices

8.6.1 Simulate Data

For our analysis experiments, simulate a sample of size 50 from a 10-variate normal distribution with a known correlation matrix
To specify this correlation matrix take the easy way out: compute an observed correlation matrix from a small sample where all correlations in the population are zero
The usual sample noise will generate some large observed correlations

Code

require(Hmisc)
require(mvtnorm)
# Get a population correlation matrix by getting sample correlations
# on a random normal sample with N=20 and all true correlations=0
# then pretend these sample correlations were real population values
set.seed(3)
x <- rmvnorm(20, sigma=diag(10))
R <- rcorr(x)$r
# True correlations we will simulate from:
round(R, 2)

       [,1]  [,2]  [,3]  [,4]  [,5]  [,6]  [,7]  [,8]  [,9] [,10]
 [1,]  1.00  0.01  0.47 -0.09 -0.31 -0.07  0.14  0.05  0.02 -0.49
 [2,]  0.01  1.00  0.48  0.27  0.27  0.14  0.40 -0.17 -0.59  0.60
 [3,]  0.47  0.48  1.00 -0.11  0.26 -0.31  0.45 -0.12 -0.50 -0.06
 [4,] -0.09  0.27 -0.11  1.00  0.42  0.42 -0.07 -0.35  0.16  0.35
 [5,] -0.31  0.27  0.26  0.42  1.00 -0.04  0.03 -0.40 -0.25  0.34
 [6,] -0.07  0.14 -0.31  0.42 -0.04  1.00  0.19 -0.36  0.08  0.40
 [7,]  0.14  0.40  0.45 -0.07  0.03  0.19  1.00 -0.37 -0.65 -0.07
 [8,]  0.05 -0.17 -0.12 -0.35 -0.40 -0.36 -0.37  1.00  0.17 -0.24
 [9,]  0.02 -0.59 -0.50  0.16 -0.25  0.08 -0.65  0.17  1.00 -0.18
[10,] -0.49  0.60 -0.06  0.35  0.34  0.40 -0.07 -0.24 -0.18  1.00

Code

# Get a huge sample from a multivariate normal distribution to see
# that it mimics the real correlation matrix R
x <- rmvnorm(50000, sigma=R)
table(round(R - rcorr(x)$r, 2))


-0.01     0  0.01 
   14    76    10

Code

# Now sample from the population to get our dataset with N=50
x <- rmvnorm(50, sigma=R)
rorig <- rcorr(x)$r
round(rorig, 2)

       [,1]  [,2]  [,3]  [,4]  [,5]  [,6]  [,7]  [,8]  [,9] [,10]
 [1,]  1.00 -0.01  0.50  0.11  0.00 -0.16  0.07 -0.04 -0.04 -0.43
 [2,] -0.01  1.00  0.50  0.19  0.43  0.13  0.60 -0.26 -0.76  0.69
 [3,]  0.50  0.50  1.00 -0.15  0.45 -0.41  0.52 -0.18 -0.58 -0.01
 [4,]  0.11  0.19 -0.15  1.00  0.45  0.30 -0.13 -0.35  0.09  0.14
 [5,]  0.00  0.43  0.45  0.45  1.00 -0.12  0.27 -0.53 -0.42  0.20
 [6,] -0.16  0.13 -0.41  0.30 -0.12  1.00  0.06 -0.27  0.00  0.51
 [7,]  0.07  0.60  0.52 -0.13  0.27  0.06  1.00 -0.57 -0.81  0.19
 [8,] -0.04 -0.26 -0.18 -0.35 -0.53 -0.27 -0.57  1.00  0.37 -0.13
 [9,] -0.04 -0.76 -0.58  0.09 -0.42  0.00 -0.81  0.37  1.00 -0.41
[10,] -0.43  0.69 -0.01  0.14  0.20  0.51  0.19 -0.13 -0.41  1.00

8.6.2 Margin of Error for a Single \(r\)

First compute the margin of error in estimating a single \(r\) from \(n=50\)
This is the spacing between \(r\) and it’s lower 0.95 CL or the spacing between \(r\) and its upper CL whichever is greatest
CL based on Fisher’s \(z\)-transformation described earlier
Compute this for 4 hypothetical true \(r\): 0 0.25 0.5 0.75

Code

r <- (0 : 3) / 4
n <- 50
zcrit <- qnorm(0.975)
z <- 0.5 * log( (1 + r) / (1 - r))
lo <- z - zcrit/sqrt(n-3)
hi <- z + zcrit/sqrt(n-3)
rlo <- (exp(2*lo)-1)/(exp(2*lo)+1)
rhi <- (exp(2*hi)-1)/(exp(2*hi)+1)
w <- rbind(r=r, 'Margin of Error'=pmax(rhi - r, r - rlo))
prmatrix(round(w, 2), collab=rep('', 4))

                                   
r               0.00 0.25 0.50 0.75
Margin of Error 0.28 0.28 0.24 0.15

If the true correlation is 0.5, the margin of error in estimating it is \(\pm 0.24\) with \(n=50\)

8.6.3 Bootstrapping the Correlation Selection Process

Can use the bootstrap to document the difficulty of the task
Steps:

form a matrix with \(N\) rows and \(p\) columns where \(N\) is the number of observations and \(p\) is the number of variables being correlated with each other
draw a sample of size \(N\) from this matrix by sampling its rows
compute all the correlation coefficients as was done previously
re-run the same selection process as was done on the original dataset
repeat 1000 times
examine the distribution of the selections over the 1000 repeats

8.6.4 Bootstrapping Bias in Largest Observed \(r\)

Start with something simpler than ranking all the \(r\)s in the matrix: estimate the bias in the largest observed \(r\)
Draw a sample with replacement from rows of the data matrix
For each of these bootstrap samples find which pair of variables has the highest \(r\)
Track this variable pair and compute \(r\) in the original sample
Compute the drop-off in \(r\) from the bootstrap sample to the original sample
This simulates the process used to find the largest \(r\)
Example: use data simulated above with 10 standard normal random variables & known correlation matrix on 50 subjects
Sample correlation matrix has \(\frac{10 \times 9}{2} = 45\) distinct coefficients

Code

# Function to retrieve the upper triangle of a symmetric matrix
# ignoring the diagonal terms
up <- function(z) z[upper.tri(z)]
rorigu <- up(rorig)
max(rorigu)   # .685 = [2,10] element; 0.604 in population

[1] 0.6854261

Code

which.max(rorigu)

[1] 38

Code

# is the 38th element in the upper triangle

# Tabulate the difference between sample r estimates and true values
Ru <- up(R)
mean(abs(Ru - rorigu))

[1] 0.1149512

Code

table(round(Ru - rorigu, 1))


-0.3 -0.2 -0.1    0  0.1  0.2 
   2    6    7    7   17    6

Code

# Repeat the "finding max r" procedure for 1000 bootstrap samples
# Sample from x 1000 times with replacement, each time computing
# a new correlation matrix
samepair <- dropsum <- 0
for(i in 1 : 1000) {
  b  <- sample(1 : 50, replace=TRUE)
  xb <- x[b, ]   # sample with replacement from rows
  r  <- rcorr(xb)$r
  ru <- up(r)
  wmax <- which.max(ru)
  if(wmax == 38) samepair <- samepair + 1
  # Compute correlation for the bootstrap best pair in the original sample
  origr <- rorigu[wmax]
  # Compute drop-off in best r
  dropoff <- ru[wmax] - origr
  dropsum <- dropsum + dropoff
}
cat('Number of bootstaps selecting the original most correlated pair:',
    samepair, 'out of 1000', '\n')

Number of bootstaps selecting the original most correlated pair: 642 out of 1000

Code

cat('Mean dropoff for max r:', round(dropsum / 1000, 3), '\n')

Mean dropoff for max r: 0.071

For our dataset with \(n=50\) we expect that the maximum observed \(r\) out of 45 \(r\)s is biased high by 0.071
- Could subtract 0.071 to debias the observed max \(r\) although this will worsen its precision

8.6.5 Bootstrapping Ranks of All \(r\)s

Do a more comprehensive assessment that quantifies the difficulty of the task in ranking all the \(r\)s
Quantity uncertainties in the ranks of the original correlation coefficients
Apparent “winner” was the one receiving the highest ranking \(q\) among all \(r\)s
What is the distribution of \(q\) over the 1000 bootstrap samples?
Can easily compute a 0.95 bootstrap nonparametric percentile confidence interval for the true unknown ranking of that feature combination and for ranking all the examined feature correlations
Bootstrap the correlation matrix and re-rank the coefficients
For each pair of variables compute the 0.95 confidence interval for the rank of its correlation from among the 45

Code

# For each observed correlation compute its rank among 45 distinct pairs
orig.ranks <- rank(up(rorig))
# Sample from x 1000 times with replacement, each time computing
# a new correlation matrix
Rm <- matrix(NA, nrow=1000, ncol=45)
for(i in 1 : 1000) {
  b  <- sample(1 : 50, replace=TRUE)
  xb <- x[b, ]
  r  <- rcorr(xb)$r
  Rm[i, ] <- rank(up(r))
}
# Over bootstrap correlations compute quantiles of ranks
low  <- apply(Rm, 2, quantile, probs=0.025)
high <- apply(Rm, 2, quantile, probs=0.975)
round(cbind('Original Rank'=orig.ranks, Lower=low, Upper=high))

      Original Rank Lower Upper
 [1,]            22    12    34
 [2,]            40    32    45
 [3,]            41    30    45
 [4,]            28    15    37
 [5,]            31    20    38
 [6,]            15     8    28
 [7,]            24    11    34
 [8,]            37    32    43
 [9,]            38    32    43
[10,]            39    31    44
[11,]            14     8    27
[12,]            29    15    37
[13,]             9     3    15
[14,]            35    22    43
[15,]            18    10    26
[16,]            26    16    34
[17,]            44    38    45
[18,]            43    34    45
[19,]            16     9    27
[20,]            34    25    38
[21,]            25    14    34
[22,]            19    11    32
[23,]            12     7    21
[24,]            13     7    27
[25,]            10     4    20
[26,]             5     3    10
[27,]            11     5    22
[28,]             4     3     8
[29,]            20    10    33
[30,]             2     1     4
[31,]             3     2    11
[32,]            27    16    36
[33,]             7     4    13
[34,]            23    13    32
[35,]             1     1     2
[36,]            36    28    43
[37,]             6     3    15
[38,]            45    40    45
[39,]            21    12    31
[40,]            30    17    37
[41,]            33    21    39
[42,]            42    34    44
[43,]            32    20    38
[44,]            17    11    28
[45,]             8     3    16

Highest observed \(r\) (rank 45) has 0.95 CI \([40,45]\)
Data are consistent with it being the \(6^\textrm{th}\) highest
Smallest observed value (rank 1; most impressive negative correlation) has CI \([1, 2]\) for rank; data consistent with that pair of variables being the best or \(2^\textrm{nd}\)
The \(r\) originally ranked \(36^\textrm{th}\) has a 0.95 CI of \([28, 43]\) so the data are consistent with it being in the top 3

8.6.6 Monte Carlo Simulation to Get A Rank Interval

For comparison with the bootstrap, get the frequency distribution of ranks in repeated studies of the apparently highest \(r\) in our \(n=50\) study
Repeated studies will also have \(n=50\) and will be generated from the population correlation matrix
Recall that original max \(r\) was the \(38^\textrm{th}\) element of the strung-out \(r\) matrix from our sample

Code

ranks <- integer(1000)
for(i in 1 : 1000) {
  xs <- rmvnorm(50, sigma=R)
  rsim <- up(rcorr(xs)$r)
  ranks[i] <- rank(rsim)[38]
}
table(ranks)   # freqs. of ranks of 38th element in new samples

ranks
 34  35  36  37  38  39  40  41  42  43  44  45 
  2   1   6   8  10  18  25  38  58  88 152 594

Code

quantile(ranks, c(0.025, 0.975))

 2.5% 97.5% 
   38    45

This interval is a bit wider than the bootstrap interval
Note: both the bootstrap and the Monte Carlo simulated interval would be far wider had the number of correlation coefficients estimated been much greater than the sample size
Example: consider 1000 possible associations with a single \(Y\)
This time the potential predictors \(X\) will be independent in the population and they will be conditioned on (held constant over simulations at the original \(X\) values)
The true importance of the \(X\)s is \(1, 2, \ldots, 10\) for the first 10 and all remaining are irrelevant in the population

Code

set.seed(8)
n <- 50
p <- 1000
x <- matrix(rnorm(n * p), ncol=p)
Ey <- x[,1] + 2 * x[,2] + 3 * x[,3] + 4 * x[,4] + 5 * x[,5] + 6 * x[,6] +
      7 * x[,7] + 8 * x[,8] + 9 * x[,9] + 10 * x[,10]
y <- Ey + rnorm(n) * 20
ro <- cor(x, y)
# First 10 correlations and tabulate all others
round(ro[1:10], 2)

 [1]  0.26 -0.16  0.05  0.29 -0.03  0.04  0.28  0.05  0.28  0.57

Code

table(round(ro[-(1:10)], 1))


-0.5 -0.4 -0.3 -0.2 -0.1    0  0.1  0.2  0.3  0.4  0.5 
   1    6   43   97  234  250  210  120   25    3    1

Code

# Find which of the 1000 correlation against Y is largest
wmax <- which.max(ro)
wmax     # correct according to population

[1] 10

Code

ro[wmax]   # original rank=1000

[1] 0.5698995

Code

# Simulate 1000 repeats of sample with new y but keeping x the same
# See how original highest correlating variable ranks among 1000
# correlations in new samples

ranks <- numeric(1000)
for(i in 1 : 1000) {
  ys <- Ey + rnorm(n) * 20
  rs <- cor(x, ys)
  ranks[i] <- rank(rs)[wmax]
}

table(round(ranks, -1))   # round to nearest 10


 350  400  580  600  610  620  630  660  670  680  690  700  710  720  730  740 
   1    1    1    1    1    2    1    1    4    2    2    2    1    2    1    1 
 760  770  780  790  800  810  820  830  840  850  860  870  880  890  900  910 
   1    1    2    4    6    2    6   11    3    6    6    9   13    8   15   15 
 920  930  940  950  960  970  980  990 1000 
  15   34   39   41   50   66  126  203  294

Code

quantile(ranks, c(0.025, 0.975))

   2.5%   97.5% 
 770.85 1000.00

Code

sum(ranks > 998)

[1] 139

The apparent winning variable could fairly easily be the \(771^\textrm{st}\) largest instead of the \(1000^\textrm{th}\) ranked correlation
The winning variable was in the top two in only 139 out of 1000 simulations
See Chapter Chapter 20 for more ways to quantify limitations of high-dimensional data analysis, including the \(p > N\) case

8.7 Nonparametric Regression

RMS 2.4.7 ABD 17.8 🅓

Linearity of relationships is often assumed; not a realistic assumption
Nonparametric regression is an excellent descriptive tool that does not assume linearity; only assumes smoothness
Estimates tendency of \(Y\) (e.g., mean, median) as a function of a continuous \(X\)
Estimates shape and steepness of the relationship
Few assumptions
Especially handy when there is a single \(X\)
Plotted trend line may be the final result of the analysis
Simplest smoother: moving average
Example with unequally-spaced \(X\)
Typical approach linking mean of \(X\) in an interval with mean of \(Y\) for that \(X\)-interval

\(X\):	1	2	3	5	8
\(Y\):	2.1	3.8	5.7	11.1	17.2

\[\begin{array}{ccc} \hat{E}(Y | X=2) &=& \frac{2.1+3.8+5.7}{3} \\ \hat{E}(Y | X=\frac{2+3+5}{3}) &=& \frac{3.8+5.7+11.1}{3} \end{array}\]

overlap OK
problem in estimating \(E(Y)\) at outer \(X\)-values
estimates very sensitive to bin width
Moving linear regression superior to moving avg. (moving flat line)
Cleveland’s moving linear regression smoother loess (locally weighted least squares) is the most popular smoother
Figure 8.4 introduced us to loess
Another example: differential diagnosis of acute bacterial meningitis vs. viral meningitis

Code

require(Hmisc)
require(data.table)
getHdata(abm)   # Loads data frame ABM (note case)
# Convert to a data table and add derived variables
setDT(ABM)
ABM[, glratio := gl / bloodgl]
ABM[, tpolys  := polys * whites / 100]
r <- ABM[, {
  plsmo(tpolys, glratio, xlab='Total Polymorphs in CSF',
       ylab='CSF/Blood Glucose Ratio',
       xlim=quantile(tpolys,  c(.05,.95), na.rm=TRUE),
       ylim=quantile(glratio, c(.05,.95), na.rm=TRUE))
 scat1d(tpolys); scat1d(glratio, side=4) } ]

Figure 8.8: *loess* nonparametric smoother relating CSF:blood glucose ratio to total CSF polymorph count in patients with either bacterial or viral meningitis. Rug plot on axes plots indicate raw data values.

The following example using “super smoother” of Friedman (1984) in place of loess to estimate a complex relationship that has a sharp turn.

Code

r <- ABM[, {
  plsmo(age, abm, 'supsmu', bass=7,
        xlab='Age at Admission, Years',
        ylab='Proportion Bacterial Meningitis')
  scat1d(age) } ]

Figure 8.9: *Super smoother* relating age to the probability of bacterial meningitis given a patient has bacterial or viral meningitis, with a rug plot showing the age distribution.

In both examples there is strong non-linearity, even non-monotonicity
But loess and the super smoother are limited to estimating a mean or proportion
Idea of moving average can be extended to all types of \(Y\), even censored survival times
Consider overlapping windows in which we compute moving statistics in general
See here for details and several examples
Brief idea:
- create overlapping moving windows of \(X\)
- for each window compute the statistic on \(Y\)
- link it with the mean of \(X\) for that window
- advance \(X\) by a small amount and repeat
Smaller windows \(\rightarrow\) noisier estimates of \(Y\) but avoid bias
Smooth the moving estimates using loess or super smoother when finished
To handle skewed \(X\) it’s best to form windows on the basis of sample size rather than on absolute \(X\)
General R function movStats in the Hmisc package
Default algorithm:
- sort the data by \(X\)
- get a target grid of \(X\) values for which to estimate \(Y\) separately
- take 15 observations to the left of the target \(X\) and 15 to the right as the window
- for each new calculation move up \(\frac{n}{200}\) observations (1 observation if \(n < 200\)) to form the new window
Example using the glucose data but estimating 3 quartiles plus the mean

Code

u <- movStats(glratio ~ tpolys, data=ABM, pr='margin')

	Window Sample Sizes
N	Mean	Min	Max	xinc
283	30.8	25	31	1

Code

ggplot(u, aes(x=tpolys, y=`Moving Median`)) + geom_line() +
    geom_ribbon(aes(ymin=`Moving Q1`, ymax=`Moving Q3`), alpha=0.2) +
  geom_line(aes(x=tpolys, y=`Moving Mean`, col=I('blue'))) + 
    xlab('Total Polymorphs in CSF') +
    ylab('CSF/Blood Glucose Ratio') + ylim(0, 1)

Figure 8.10: Moving quartiles and mean (blue line) to relate CSF:blood glucose ratio to total CSF polymorph count in patients with either bacterial or viral meningitis. Black line is the moving median. The band depicts outer quartiles.

Age example

Code

require(qreport)   # provides addggLayers
u <- movStats(abm ~ age, msmooth='both', data=ABM, melt=TRUE, pr='margin')

	Window Sample Sizes
N	Mean	Min	Max	xinc
420	30.9	25	31	2

Code

g <- ggplot(u, aes(x=age, y=abm)) + geom_line(aes(col=Type)) +
  xlab('Age at Admission, years') +
  ylab('Proportion Bacterial Meningitis')
g <- addggLayers(g, ABM, value='age', type='spike')
addggLayers(g, ABM, value='age', type='ebp', pos='top')

Figure 8.11: Moving proportions relating age to the probability of bacterial meningitis given a patient has bacterial or viral meningitis, with a spike histogram and an extended box plot showing the age distribution. Moving proportions before smoothing are also shown. To the right of the extended box plot is a dot for the 0.99 quantile of age

We also need to estimate the relationship between continuous variables and time-to-event when follow-up is incomplete
Example: VA lung cancer dataset
For each moving window compute the Kaplan-Meier 6m and 12m survival estimates
Subtract cumulative survival from 1.0 to get cumulative incidence

Code

getHdata(valung)
require(survival)    # for Surv()
u <- movStats(Surv(t / (365.25 / 12), dead) ~ age, times=c(6, 12),
              eps=25, data=valung, melt=TRUE, pr='margin', tunits='month')

	Window Sample Sizes
N	Mean	Min	Max	xinc
137	48.7	35	51	1

Code

g <- ggplot(u, aes(x=age, y=incidence)) + geom_line(aes(col=Statistic)) +
      guides(color=guide_legend(title='')) + ylab('Mortality')
addggLayers(g, valung, value='age', type='spike')

Figure 8.12: Moving 6m and 12m mortality as a function of age using VA lung cancer dataset