## Code

```
<- lrm(y ~ sex + race + rcs(age,5) + rcs(weight,5) +
f rcs(height,5) + rcs(blood.pressure,5))
plot(anova(f))
```

- 4.1 Prespecification of Predictor Complexity Without Later Simplification
- 4.2 Checking Assumptions of Multiple Predictors Simultaneously
- 4.3 Variable Selection
- 4.4 Overfitting and Limits on Number of Predictors
- 4.5 Shrinkage
- 4.6 Collinearity
- 4.7 Data Reduction
- 4.8 Other Approaches to Predictive Modeling
- 4.9 Overly Influential Observations
- 4.10 Comparing Two Models
- 4.11 Improving the Practice of Multivariable Prediction
- 4.12 Summary: Possible Modeling Strategies
- 4.13 Study Questions

- “Spending d.f.”: examining or fitting parameters in models, or examining tables or graphs that utilize \(Y\) to tell you how to model variables
- If wish to preserve statistical properties, can’t retrieve d.f. once they are “spent” (see Grambsch & O’Brien)
- If a scatterplot suggests linearity and you fit a linear model, how many d.f. did you actually spend (i.e., the d.f. that when put into a formula results in accurate confidence limits or \(P\)-values)?
- Decide number of d.f. that can be spent
- Decide where to spend them
- Spend them
- General references: (Giudice et al., 2011; Frank E. Harrell et al., 1996; Nick & Hardin, 1999; Steyerberg et al., 2001)

AB

There are many choices to be made when deciding upon a global modeling strategy, including choice between

- parametric and nonparametric procedures
- parsimony and complexity
- parsimony and good discrimination ability
- interpretable models and black boxes.

C

- Rarely expect linearity
- Can’t always use graphs or other devices to choose transformation
- If select from among many transformations, results biased
- Need to allow flexible nonlinearity to potentially strong predictors not
*known*to predict linearly - Once decide a predictor is “in” can choose no. of parameters to devote to it using a general association index with \(Y\)
- Need a measure of “potential predictive punch”
- Measure needs to mask analyst to true form of regression to preserve statistical properties

D

Motivating examples:

```
# Overfitting a flat relationship
require(rms)
set.seed(1)
x <- runif(1000)
y <- runif(1000, -0.5, 0.5)
dd <- datadist(x, y); options(datadist='dd')
par(mfrow=c(2,2), mar=c(2, 2, 3, 0.5))
pp <- function(actual) {
yhat <- predict(f, data.frame(x=xs))
yreal <- actual(xs)
plot(0, 0, xlim=c(0,1),
ylim=range(c(quantile(y, c(0.1, 0.9)), yhat,
yreal)),
type='n', axes=FALSE)
axis(1, labels=FALSE); axis(2, labels=FALSE)
lines(xs, yreal)
lines(xs, yhat, col='blue')
}
f <- ols(y ~ rcs(x, 5))
xs <- seq(0, 1, length=150)
pp(function(x) 0*x)
title('Mild Error:\nOverfitting a Flat Relationship',
cex=0.5)
y <- x + runif(1000, -0.5, 0.5)
f <- ols(y ~ rcs(x, 5))
pp(function(x) x)
title('Mild Error:\nOverfitting a Linear Relationship',
cex=0.5)
y <- x^4 + runif(1000, -1, 1)
f <- ols(y ~ x)
pp(function(x) x^4)
title('Serious Error:\nUnderfitting a Steep Relationship',
cex=0.5)
y <- - (x - 0.5) ^ 2 + runif(1000, -0.2, 0.2)
f <- ols(y ~ x)
pp(function(x) - (x - 0.5) ^ 2)
title('Tragic Error:\nMonotonic Fit to\nNon-Monotonic Relationship',
cex=0.5)
```

Categorical predictor with \(k\) levels | Collapse less frequent categories into “other” |

Continuous predictor represented as \(k\)-knot restricted cubic spline | Reduce \(k\) to a number as low as 3, or 0 (linear) |

E

When the effective sample size available is sufficiently large so that a saturated main effects model may be fitted, a good approach to gauging predictive potential is the following.

- Let all continuous predictors be represented as restricted cubic splines with \(k\) knots, where \(k\) is the maximum number of knots the analyst entertains for the current problem.
- Let all categorical predictors retain their original categories except for pooling of very low prevalence categories (e.g., ones containing \(<6\) observations).
- Fit this general main effects model.
- Compute the partial \(\chi^2\) statistic for testing the association of each predictor with the response, adjusted for all other predictors. In the case of ordinary regression convert partial \(F\) statistics to \(\chi^2\) statistics or partial \(R^2\) values.
- Make corrections for chance associations to “level the playing field” for predictors having greatly varying d.f., e.g., subtract the d.f. from the partial \(\chi^2\) (the expected value of \(\chi^{2}_{p}\) is \(p\) under \(H_{0}\)).
- Make certain that tests of nonlinearity are not revealed as this would bias the analyst.
- Sort the partial association statistics in descending order.

F

Commands in the `rms`

package can be used to plot only what is needed. Here is an example for a logistic model.

When collinearities or confounding are not problematic, a quicker approach based on pairwise measures of association can be useful. This approach will not have numerical problems (e.g., singular covariance matrix) and is based on:

- 2 d.f. generalization of Spearman \(\rho\)—\(R^2\) based on \(rank(X)\) and \(rank(X)^2\) vs. \(rank(Y)\)
- \(\rho^2\) can detect U-shaped relationships
- For categorical \(X\), \(\rho^2\) is \(R^2\) from dummy variables regressed against \(rank(Y)\); this is tightly related to the Wilcoxon-Mann-Whitney-Kruskal-Wallis rank test for group differences
^{1} - Sort variables by descending order of \(\rho^2\)
- Specify number of knots for continuous \(X\), combine infrequent categories of categorical \(X\) based on \(\rho^2\)

G

^{1} This test statistic does not inform the analyst of *which* groups are different from one another.

Allocating d.f. based on partial tests of association or sorting \(\rho^2\) is a fair procedure because

- We already decided to keep variable in model no matter what \(\rho^2\) or \(\chi^2\) values are seen
- \(\rho^2\) and \(\chi^2\) do not reveal degree of nonlinearity; high value may be due solely to strong linear effect
- low \(\rho^2\) or \(\chi^2\) for a categorical variable might lead to collapsing the most disparate categories

H

Initial simulations show the procedure to be conservative. Note that one can move from simpler to more complex models but not the other way round

- Sometimes failure to adjust for other variables gives wrong transformation of an \(X\), or wrong significance of interactions
- Sometimes unwieldy to deal simultaneously with all predictors at each stage \(\rightarrow\) assess regression assumptions separately for each predictor

I

- Series of potential predictors with no prior knowledge
- \(\uparrow\) exploration \(\rightarrow\) \(\uparrow\) shrinkage (overfitting)
- Summary of problem: \(E(\hat{\beta} | \hat{\beta}\) “significant” \() \neq \beta\) (Chatfield, 1995)
- Biased \(R^2\), \(\hat{\beta}\), standard errors, \(P\)-values too small
- \(F\) and \(\chi^2\) statistics do not have the claimed distribution
^{2}(Grambsch & O’Brien, 1991) - Will result in residual confounding if use variable selection to find confounders (Greenland, 2000)
- Derksen & Keselman (1992) found that in stepwise analyses the final model represented noise 0.20-0.74 of time, final model usually contained \(< \frac{1}{2}\) actual number of authentic predictors. Also:

J

^{2} Lockhart et al. (2013) provide an example with \(n=100\) and 10 orthogonal predictors where all true \(\beta\)s are zero. The test statistic for the first variable to enter has type I assertion probability of 0.39 when the nominal \(\alpha\) is set to 0.05.

- The degree of correlation between the predictor variables affected the frequency with which authentic predictor variables found their way into the final model.
- The number of candidate predictor variables affected the number of noise variables that gained entry to the model.
- The size of the sample was of little practical importance in determining the number of authentic variables contained in the final model.
- The population multiple coefficient of determination could be faithfully estimated by adopting a statistic that is adjusted by the total number of candidate predictor variables rather than the number of variables in the final model.

- Global test with \(p\) d.f. insignificant \(\rightarrow\)
**stop**

Simulation experiment, true \(\sigma^{2} = 6.25\), 8 candidate variables, 4 of them related to \(Y\) in the population. Select best model using all possible subsets regression to maximize \(R^{2}_{adj}\) (all possible subsets is not usually recommended but gives variable selection more of a chance to work in this context).

L

**Note**: The audio was made using stepAIC with collinearities in predictors. The code below allows for several options. Here we use all possible subsets of predictors and force predictors to be uncorrelated, which is the easiest case for variable selection.

```
require(MASS) # provides stepAIC function
require(leaps) # provides regsubsets function
sim <- function(n, sigma=2.5, method=c('stepaic', 'leaps'),
pr=FALSE, prcor=FALSE, dataonly=FALSE) {
method <- match.arg(method)
if(uncorrelated) {
x1 <- rnorm(n)
x2 <- rnorm(n)
x3 <- rnorm(n)
x4 <- rnorm(n)
x5 <- rnorm(n)
x6 <- rnorm(n)
x7 <- rnorm(n)
x8 <- rnorm(n)
}
else {
x1 <- rnorm(n)
x2 <- x1 + 2.0 * rnorm(n) # was + 0.5 * rnorm(n)
x3 <- rnorm(n)
x4 <- x3 + 1.5 * rnorm(n)
x5 <- x1 + rnorm(n)/1.3
x6 <- x2 + 2.25 * rnorm(n) # was rnorm(n)/1.3
x7 <- x3 + x4 + 2.5 * rnorm(n) # was + rnorm(n)
x8 <- x7 + 4.0 * rnorm(n) # was + 0.5 * rnorm(n)
}
z <- cbind(x1,x2,x3,x4,x5,x6,x7,x8)
if(prcor) return(round(cor(z), 2))
lp <- x1 + x2 + .5*x3 + .4*x7
y <- lp + sigma*rnorm(n)
if(dataonly) return(list(x=z, y=y))
if(method == 'leaps') {
s <- summary(regsubsets(z, y))
best <- which.max(s$adjr2)
xvars <- s$which[best, -1] # remove intercept
ssr <- s$rss[best]
p <- sum(xvars)
xs <- if(p == 0) 'none' else paste((1 : 8)[xvars], collapse='')
if(pr) print(xs)
ssesw <- (n - 1) * var(y) - ssr
s2s <- ssesw / (n - p - 1)
yhat <- if(p == 0) mean(y) else fitted(lm(y ~ z[, xvars]))
}
f <- lm(y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8)
if(method == 'stepaic') {
g <- stepAIC(f, trace=0)
p <- g$rank - 1
xs <- if(p == 0) 'none' else
gsub('[ <br>+x]','', as.character(formula(g))[3])
if(pr) print(formula(g), showEnv=FALSE)
ssesw <- sum(resid(g)^2)
s2s <- ssesw/g$df.residual
yhat <- fitted(g)
}
# Set SSEsw / (n - gdf - 1) = true sigma^2
gdf <- n - 1 - ssesw / (sigma^2)
# Compute root mean squared error against true linear predictor
rmse.full <- sqrt(mean((fitted(f) - lp) ^ 2))
rmse.step <- sqrt(mean((yhat - lp) ^ 2))
list(stats=c(n=n, vratio=s2s/(sigma^2),
gdf=gdf, apparentdf=p, rmse.full=rmse.full, rmse.step=rmse.step),
xselected=xs)
}
rsim <- function(B, n, method=c('stepaic', 'leaps')) {
method <- match.arg(method)
xs <- character(B)
r <- matrix(NA, nrow=B, ncol=6)
for(i in 1:B) {
w <- sim(n, method=method)
r[i,] <- w$stats
xs[i] <- w$xselected
}
colnames(r) <- names(w$stats)
s <- apply(r, 2, median)
p <- r[, 'apparentdf']
s['apparentdf'] <- mean(p)
print(round(s, 2))
print(table(p))
cat('Prob[correct model]=', round(sum(xs == '1237')/B, 2), '\n')
}
```

Show the correlation matrix being assumed for the \(X\)s:

```
x1 x2 x3 x4 x5 x6 x7 x8
x1 1.00 0.00 -0.01 0.00 0.01 0.01 0.00 0.01
x2 0.00 1.00 0.00 0.00 0.01 0.00 0.00 0.00
x3 -0.01 0.00 1.00 0.00 0.00 0.00 0.00 -0.01
x4 0.00 0.00 0.00 1.00 0.01 0.00 0.00 0.01
x5 0.01 0.01 0.00 0.01 1.00 0.01 0.00 0.00
x6 0.01 0.00 0.00 0.00 0.01 1.00 0.00 0.00
x7 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.01
x8 0.01 0.00 -0.01 0.01 0.00 0.00 0.01 1.00
```

Simulate to find the distribution of the number of variables selected, the proportion of simulations in which the true model (\(X_{1}, X_{2}, X_{3}, X_{7}\)) was found, the median value of \(\hat{\sigma}^{2}/\sigma^{2}\), the median effective d.f., and the mean number of apparent d.f., for varying sample sizes.

M

```
n vratio gdf apparentdf rmse.full rmse.step
20.00 0.94 5.32 4.10 1.62 1.58
p
1 2 3 4 5 6 7 8
3 14 18 22 27 11 4 1
Prob[correct model]= 0.02
```

```
n vratio gdf apparentdf rmse.full rmse.step
40.00 0.61 17.89 4.38 1.21 1.24
p
2 3 4 5 6 7
9 18 24 29 15 5
Prob[correct model]= 0.04
```

```
n vratio gdf apparentdf rmse.full rmse.step
150.00 0.44 85.99 5.01 0.59 0.57
p
2 3 4 5 6 7 8
1 5 27 35 24 7 1
Prob[correct model]= 0.2
```

```
n vratio gdf apparentdf rmse.full rmse.step
300.00 0.42 177.01 5.16 0.43 0.40
p
4 5 6 7 8
27 42 20 10 1
Prob[correct model]= 0.26
```

```
n vratio gdf apparentdf rmse.full rmse.step
2000.00 1.00 6.43 4.58 0.17 0.15
p
4 5 6 7
53 37 9 1
Prob[correct model]= 0.53
```

As \(n\uparrow\) the mean number of variables selected increased. The proportion of simulations in which the correct model was found increased from 0 to 0.53. \(\sigma^{2}\) is underestimated when \(n=300\) by a factor of 0.42, resulting in the d.f. needed to de-bias \(\hat{\sigma^{2}}\) being greater than \(n\) when the apparent d.f. was only 5.16 on the average. Variable selection slightly increased closeness to the true \(X\beta\).

If the simulations are re-run allowing for collinearities (`uncorrelated=FALSE`

) one can expect variable selection to be even more problematic.

Variable selection methods (F. E. Harrell, 1986):

- Forward selection, backward elimination
- Stopping rule: “residual \(\chi^{2}\)” with d.f. = no. candidates remaining at current step
- Test for significance or use Akaike’s information criterion (AIC (Atkinson, 1980)), here \(\chi^{2}-2 \times d.f.\)
- Better to use subject matter knowledge!
- No currently available stopping rule was developed for stepwise, only for comparing a limited number of pre-specified models(Leo Breiman, 1992, sec. 1.3)
- Roecker (1991) studied forward selection (FS), all possible subsets selection (APS), full fits
- APS more likely to select smaller, less accurate models than FS
- Neither as accurate as full model fit unless more than half of candidate variables redundant or unnecessary
- Step-down is usually better than forward (Mantel, 1970) and can be used efficiently with maximum likelihood estimation (Lawless & Singhal, 1978)
- Fruitless to try different stepwise methods to look for agreement (Wiegand, 2010)
- Bootstrap can help decide between full and reduced model
- Full model fits gives meaningful confidence intervals with standard formulas, C.I. after stepwise does not (Altman & Andersen, 1989; Leo Breiman, 1992; Hurvich & Tsai, 1990)
- Data reduction (grouping variables) can help
- Using the bootstrap to select important variables for inclusion in the final model (Sauerbrei & Schumacher, 1992) is problematic (Austin, 2008)
- It is not logical that a population regression coefficient would be exactly zero just because its estimate was “insignificant”

NOP

See also these articles:

- Step away from stepwise by Gary Smith
- Five myths about variable selection by Georg Heinze and Daniela Dunkler
- Variable selection - A review and recommendations for the practicing statistician by Georg Heinze, Christine Wallisch, Daniela Dunkler
- Stopping stepwise by Peter Flom

Q

Some of the information in the data is spent on variable selection instead of using all information for estimation.

Model *specification* is preferred to model *selection*.

Information content of the data usually insufficient for reliable variable selection.

James Clerk Maxwell

Maxwell imagines one container divided into two parts, A and B. Both parts are filled with the same gas at equal temperatures and placed next to each other. Observing the molecules on both sides, an imaginary demon guards a trapdoor between the two parts. When a faster-than-average molecule from A flies towards the trapdoor, the demon opens it, and the molecule will fly from A to B. Likewise, when a slower-than-average molecule from B flies towards the trapdoor, the demon will let it pass from B to A. The average speed of the molecules in B will have increased while in A they will have slowed down on average. Since average molecular speed corresponds to temperature, the temperature decreases in A and increases in B, contrary to the second law of thermodynamics.

Szilárd pointed out that a real-life Maxwell’s demon would need to have some means of measuring molecular speed, and that the act of acquiring information would require an expenditure of energy. Since the demon and the gas are interacting, we must consider the total entropy of the gas and the demon combined. The expenditure of energy by the demon will cause an increase in the entropy of the demon, which will be larger than the lowering of the entropy of the gas.

Source: commons.wikimedia.org/wiki/File:YoungJamesClerkMaxwell.jpg

en.wikipedia.org/wiki/Maxwell’s_demon

Peter Ellis’ blog article contains excellent examples of issues discussed here but applied to time series modeling.

- Concerned with avoiding overfitting
- Assume typical problem in medicine, epidemiology, and the social sciences in which the signal:noise ratio is small (higher ratios allow for more aggressive modeling)
- \(p\) should be \(< \frac{m}{15}\) (F. E. Harrell et al., 1984, 1985; Peduzzi et al., 1995, 1996; Smith et al., 1992; van der Ploeg et al., 2014; Vittinghoff & McCulloch, 2006)
- \(p\) is number of parameters in full model or number of
*candidate*parameters in a stepwise analysis - Derived from simulations to find minimum sample size so that apparent discrimination = validated discrimination
- Applies to typical signal:noise ratios found outside of tightly controlled experiments
- If true \(R^{2}\) is high, many parameters can be estimated from smaller samples
- Ignores sample size needed just to estimate the intercept or, in semiparametric models, the underlying distribution function
^{3} - Riley, Snell, Ensor, Burke, Jr, et al. (2019) and Riley, Snell, Ensor, Burke, Harrell, et al. (2019) have refined sample size estimation for continuous, binary, and time-to-event models to account for all of these issues
- To just estimate \(\sigma\) in a linear model with a multiplicative margin of error of 1.2 with 0.95 confidence requires \(n=70\)

R

^{3} The sample size needed for these is model-dependent

S

Type of Response Variable | Limiting Sample Size \(m\) |
---|---|

Continuous | \(n\) (total sample size) |

Binary | \(3np(1-p), p=\frac{n_{2}}{n}\) ^{4} |

Ordinal (\(k\) categories) | \(n - \frac{1}{n^{2}} \sum_{i=1}^{k} n_{i}^{3}\) ^{5} |

Failure (survival) time | number of failures ^{6} |

^{4} If one considers the power of a two-sample binomial test compared with a Wilcoxon test if the response could be made continuous and the proportional odds assumption holds, the effective sample size for a binary response is \(3 n_{1}n_{2}/n \approx 3 \min(n_{1}, n_{2})\) if \(\frac{n_{1}}{n}\) is near 0 or 1 (Whitehead, 1993, Eq. 10, 15). Here \(n_{1}\) and \(n_{2}\) are the marginal frequencies of the two response levels (Peduzzi et al., 1996). The effective sample size is a maximum (\(0.75n\)) when \(n_{1}=n_{2}\), i.e. \(p=\frac{1}{2}\) .

^{5} Based on the power of a proportional odds model two-sample test when the marginal cell sizes for the response are \(n_{1}, \ldots, n_{k}\), compared with all cell sizes equal to unity (response is continuous) (Whitehead, 1993, Eq. 3). If all cell sizes are equal, the relative efficiency of having \(k\) response categories compared to a continuous response is \(1 - \frac{1}{k^{2}}\) (Whitehead, 1993, Eq. 14), e.g., a 5-level response is almost as efficient as a continuous one if proportional odds holds across category cutoffs.

^{6} This is approximate, as the effective sample size may sometimes be boosted somewhat by censored observations, especially for non-proportional hazards methods such as Wilcoxon-type tests (Benedetti et al., 1982).

- Narrowly distributed predictor \(\rightarrow\) even higher \(n\)
- \(p\) includes
*all*variables screened for association with response, including interactions - Univariable screening (graphs, crosstabs, etc.)
**in no way**reduces multiple comparison problems of model building (Sun et al., 1996)

T

More About the Binary \(Y\) Case

To derive the effective sample size for binary \(Y\) we compare two samples of equal size \(\frac{n}{2}\) by testing whether the log odds ratio is zero. When the two samples come from populations with equal \(\Pr(Y=1) = p\), the variance of the log odds ratio is approximately \(2 \times \frac{1}{\frac{n}{2}p(1-p)} = \frac{4}{np(1-p)}\). When \(Y\) is continuous with no ties and a Wilcoxon or proportional odds model test is used, the reciprocal of the variance of the proportional odds model’s log odds ratio is \(\frac{n^{3}}{12 (n+1)^{2}} \times (1 - \frac{1}{n^{2}}) \approx \frac{n}{12}\) so the variance is approximately \(\frac{12}{n}\). The ratio of this to the binary \(Y\) variance is \(3p(1-p)\).

- Slope of calibration plot; regression to the mean
- Statistical estimation procedure — “pre-shrunk” models
- Aren’t regression coefficients OK because they’re unbiased?
- Problem is in how we use coefficient estimates
- Consider 20 samples of size \(n=50\) from \(U(0,1)\)
- Compute group means and plot in ascending order
- Equivalent to fitting an intercept and 19 dummies using least squares
- Result generalizes to general problems in plotting \(Y\) vs.

\(X\hat{\beta}\)

U

```
set.seed(123)
n <- 50
y <- runif(20*n)
group <- rep(1:20,each=n)
ybar <- tapply(y, group, mean)
ybar <- sort(ybar)
plot(1:20, ybar, type='n', axes=FALSE, ylim=c(.3,.7),
xlab='Group', ylab='Group Mean')
lines(1:20, ybar)
points(1:20, ybar, pch=20, cex=.5)
axis(2)
axis(1, at=1:20, labels=FALSE)
for(j in 1:20) axis(1, at=j, labels=names(ybar)[j])
```

Prevent shrinkage by using pre-shrinkage

Spiegelhalter (1986): var. selection arbitrary, better prediction usually results from fitting all candidate variables and using shrinkage

Shrinkage closer to that expected from full model fit than based on number of significant variables (Copas, 1983)

Ridge regression (le Cessie & van Houwelingen, 1992; van Houwelingen & le Cessie, 1990)

Penalized MLE (Gray, 1992; Frank E. Harrell et al., 1998; Verweij & van Houwelingen, 1994)

Heuristic shrinkage parameter of van Houwelingen and le Cessie (van Houwelingen & le Cessie, 1990, Eq. 77) \[\hat{\gamma} = \frac{{\rm model}\ \chi^{2} - p}{{\rm model}\ \chi^{2}}, \tag{4.1}\]

OLS

^{7}: \(\hat{\gamma} = \frac{n-p-1}{n-1} R^{2}_{\rm adj} / R^{2}\)

\(R^{2}_{\rm adj} = 1 - (1 - R^{2})\frac{n-1}{n-p-1}\)\(p\) close to no. candidate variables

Copas (1983) (Eq. 8.5) adds 2 to numerator

- When at least 1 predictor can be predicted well from others
- Can be a blessing (data reduction, transformations)
- \(\uparrow\) s.e. of \(\hat{\beta}\), \(\downarrow\) power
- This is appropriate \(\rightarrow\) asking too much of the data (Chatterjee & Hadi, 2012, Chapter 9)
- Variables compete in variable selection, chosen one arbitrary
- Does not affect joint influence of a set of highly correlated variables (use multiple d.f. tests)
- Does not at all affect predictions on model construction sample
- Does not affect predictions on new data (Myers, 1990, pp. 379–381) if
- Extreme extrapolation not attempted
- New data have same type of collinearities as original data

- Example: LDL and total cholesterol – problem only if more inconsistent in new data
- Example: age and age\(^{2}\) – no problem
- One way to quantify for each predictor: variance inflation factors (VIF)
- General approach (maximum likelihood) — transform information matrix to correlation form, VIF=diagonal of inverse (Davis et al., 1986; Wax, 1992)
- See David A. Belsley (1991), pp. 28-30 for problems with VIF
- Easy approach: SAS
`VARCLUS`

procedure (Sarle, 1990),`R`

`varclus`

function, other clustering techniques: group highly correlated variables - Can score each group (e.g., first principal component, \(PC_1\) (D’Agostino et al., 1995)); summary scores not collinear

XYZ

- Unless \(n>>p\), model unlikely to validate
- Data reduction: \(\downarrow p\)
- Use the literature to eliminate unimportant variables.
- Eliminate variables whose distributions are too narrow.
- Eliminate candidate predictors that are missing in a large number of subjects, especially if those same predictors are likely to be missing for future applications of the model.
- Use a statistical data reduction method such as incomplete principal components regression, nonlinear generalizations of principal components such as principal surfaces, sliced inverse regression, variable clustering, or ordinary cluster analysis on a measure of similarity between variables.
- Data reduction is
**completely masked to \(Y\)**, which is precisely why it does not distort estimates, standard errors, \(P\)-values, or confidence limits - Data reduction =
*unsupervised learning* - Example: dataset with 40 events and 60 candidate predictors
- Use variable clustering to group variables by correlation structure
- Use clinical knowledge to refine the clusters
- Keep age and severity of disease as separate predictors because of their strength
- For others create clusters: socioeconomic, risk factors/history, and physiologic function
- Summarize each cluster with its first principal component \(PC_{1}\), i.e., the linear combination of characteristics that maximizes variance of the score across subjects subject to an overall constraint on the coefficients
- Fit outcome model with 5 predictors

- See this for useful resources related to principal component analysis of mixed continuous and categorical variables
- See this for an excellent discussion of whether data reduction (unsupervised learning) steps need to be included in resampling for model validation

A

- Remove variables that have poor distributions
- E.g., categorical variables with fewer than 2 categories having at least 20 observations

- Use flexible additive parametric additive models to determine how well each variable can be predicted from the remaining variables
- Variables dropped in stepwise fashion, removing the most predictable variable at each step
- Remaining variables used to predict
- Process continues until no variable still in the list of predictors can be predicted with an \(R^2\) or adjusted \(R^2\) greater than a specified threshold or until dropping the variable with the highest \(R^2\) (adjusted or ordinary) would cause a variable that was dropped earlier to no longer be predicted at the threshold from the now smaller list of predictors
`R`

function`redun`

in`Hmisc`

package- Related to
*principal variables*(McCabe, 1984) but faster

B

- Goal: Separate variables into groups
- variables within group correlated with each other
- variables not correlated with non-group members

- Score each dimension, stop trying to separate effects of factors measuring same phenomenon
- Variable clustering (D’Agostino et al., 1995; Sarle, 1990) (oblique-rotation PC analysis) \(\rightarrow\) separate variables so that first PC is representative of group
- Can also do hierarchical cluster analysis on similarity matrix based on squared Spearman or Pearson correlations, or more generally, Hoeffding’s \(D\) (Hoeffding, 1948).
- See Guo et al. (2011) for a method related to variable clustering and sparse principal components.
- Chavent et al. (2012) implement many more variable clustering methods

C

Example: Figure 15.6

Reduce \(p\) by estimating transformations using associations with other predictors

Purely categorical predictors – correspondence analysis (Ciampi et al., 1986; Crichton & Hinde, 1989; Greenacre, 1988; Leclerc et al., 1988; Michailidis & de Leeuw, 1998)

Mixture of qualitative and continuous variables: qualitative principal components

Maximum total variance (MTV) of Young, Takane, de Leeuw (Michailidis & de Leeuw, 1998; Young et al., 1978)

- Compute \(PC_1\) of variables using correlation matrix
- Use regression (with splines, dummies, etc.) to predict \(PC_1\) from each \(X\) — expand each \(X_j\) and regress it separately on \(PC_1\) to get working transformations
- Recompute \(PC_1\) on transformed \(X\)s
- Repeat 3-4 times until variation explained by \(PC_1\) plateaus and transformations stabilize

Maximum generalized variance (MGV) method of Sarle (Kuhfeld, 2009, pp. 1267–1268)

- Predict each variable from (current transformations of) all other variables
- For each variable, expand it into linear and nonlinear terms or dummies, compute first canonical variate
- For example, if there are only two variables \(X_{1}\) and \(X_{2}\) represented as quadratic polynomials, solve for \(a,b,c,d\) such that \(aX_{1}+bX_{1}^{2}\) has maximum correlation with \(cX_{2}+dX_{2}^{2}\).
- Goal is to transform each var. so that it is most similar to predictions from other transformed variables
- Does not rely on PCs or variable clustering

MTV (PC-based instead of canonical var.) and MGV implemented in SAS

`PROC PRINQUAL`

(Kuhfeld, 2009)- Allows flexible transformations including monotonic splines
- Does not allow restricted cubic splines, so may be unstable unless monotonicity assumed
- Allows simultaneous imputation but often yields wild estimates

D

`R`

`transcan`

Function for Data Reduction & Imputation

Initialize missings to medians (or most frequent category)

Initialize transformations to original variables

Take each variable in turn as \(Y\)

Exclude obs. missing on \(Y\)

Expand \(Y\) (spline or dummy variables)

Score (transform \(Y\)) using first canonical variate

Missing \(Y\) \(\rightarrow\) predict canonical variate from \(X\)s

The imputed values can optionally be shrunk to avoid overfitting for small \(n\) or large \(p\)

Constrain imputed values to be in range of non-imputed ones

Imputations on original scale

- Continuous \(\rightarrow\) back-solve with linear interpolation
- Categorical \(\rightarrow\) classification tree (most freq. cat.) or match to category whose canonical score is closest to one predicted

Multiple imputation — bootstrap or approx. Bayesian bootstrap.

- Sample residuals multiple times (default \(M=5\))
- Are on “optimally” transformed scale
- Back-transform
`fit.mult.impute`

works with`aregImpute`

and`transcan`

output to easily get imputation-corrected variances and avg. \(\hat{\beta}\)

Option to insert constants as imputed values (ignored during transformation estimation); helpful when a lab value may be missing because the patient returned to normal

Imputations and transformed values may be easily obtained for new data

An

`R`

function`Function`

will create a series of`R`

functions that transform each predictorExample: \(n=415\) acutely ill patients

- Relate heart rate to mean arterial blood pressure
- Two blood pressures missing
- Heart rate not monotonically related to blood pressure
- See Figure 4.3 and Figure 4.4

E

```
Mean Arterial Blood Pressure Day 3
[1] 151 136
```

```
Convergence criterion:2.901 0.035 0.007
Convergence in 4 iterations
R-squared achieved in predicting each variable:
heart.rate blood.pressure
0.259 0.259
Adjusted R-squared:
heart.rate blood.pressure
0.254 0.253
```

```
400 401
132.4057 109.7741
```

ACE (Alternating Conditional Expectation) of L. Breiman & Friedman (1985)

- Uses nonparametric “super smoother” (Friedman, 1984)
- Allows monotonicity constraints, categorical vars.
- Does not handle missing data

F

These methods find

*marginal*transformationsCheck adequacy of transformations using \(Y\)

- Graphical
- Nonparametric smoothers (\(X\) vs. \(Y\))
- Expand original variable using spline, test additional predictive information over original transformation

G

- Try to score groups of transformed variables with \(PC_1\)
- Reduces d.f. by pre-transforming var. and by combining multiple var.
- Later may want to break group apart, but delete all variables in groups whose summary scores do not add significant information
- Sometimes simplify cluster score by finding a subset of its constituent variables which predict it with high \(R^2\).

H

Series of dichotomous variables:

- Construct \(X_1\) = 0-1 according to whether any variables positive
- Construct \(X_2\) = number of positives
- Test whether original variables add to \(X_1\) or \(X_2\)

I

Using Expected Shrinkage to Guide Data Reduction

- Fit full model with all candidates, \(p\) d.f., LR likelihood ratio \(\chi^2\)
- Compute \(\hat{\gamma}\)
- If \(<0.9\), consider shrunken estimator from whole model, or data reduction (again not using \(Y\))
- \(q\) regression d.f. for reduced model
- Assume best case: discarded dimensions had no association with \(Y\)
- Expected loss in LR is \(p-q\)
- New shrinkage \([{\rm LR} - (p-q) - q]/[{\rm LR} - (p-q)]\)
- Solve for \(q\) \(\rightarrow\) \(q \leq ({\rm LR}-p)/9\)
- Under these assumptions, no hope unless original LR \(> p+9\)
- No \(\chi^2\) lost by dimension reduction \(\rightarrow\) \(q \leq {\rm LR}/10\)

J

Example:

- Binary logistic model, 45 events on 150 subjects
- 10:1 rule \(\rightarrow\) analyze 4.5 d.f. total
- Analyst wishes to include age, sex, 10 others
- Not known if age linear or if age and sex additive
- 4 knots \(\rightarrow\) \(3+1+1\) d.f. for age and sex if restrict interaction to be linear
- Full model with 15 d.f. has LR=50
- Expected shrinkage factor \((50-15)/50 = 0.7\)
- LR\(>15+9=24\) \(\rightarrow\) reduction may help
- Reduction to \(q = (50-15)/9 \approx 4\) d.f. necessary
- Have to assume age linear, reduce other 10 to 1 d.f.
- Separate hypothesis tests intended \(\rightarrow\) use full model, adjust for multiple comparisons

K

Goals | Reasons | Methods |
---|---|---|

Group predictors so that each group represents a single dimension that can be summarized with a single score | • \(\downarrow\) d.f. arising from multiple predictors • Make \(PC_1\) more reasonable summary |
Variable clustering • Subject matter knowledge • Group predictors to maximize proportion of variance explained by \(PC_1\) of each group • Hierarchical clustering using a matrix of similarity measures between predictors |

Transform predictors | • \(\downarrow\) d.f. due to nonlinear and dummy variable components • Allows predictors to be optimally combined • Make \(PC_1\) more reasonable summary • Use in customized model for imputing missing values on each predictor |
• Maximum total variance on a group of related predictors • Canonical variates on the total set of predictors |

- Every observation should influence fit
- Major results should not rest on 1 or 2 obs.
- Overly infl. obs. \(\rightarrow\) \(\uparrow\) variance of predictions
- Also affects variable selection

L

Reasons for influence:

- Too few observations for complexity of model (see Section 4.7, Section 4.3)
- Data transcription or entry errors
- Extreme values of a predictor

M

- Sometimes subject so atypical should remove from dataset
- Sometimes truncate measurements where data density ends
- Example: \(n=4000\), 2000 deaths, white blood count range 500-100,000, .05,.95 quantiles=2755, 26700
- Linear spline function fit
- Sensitive to WBC\(>60000\ (n=16)\)
- Predictions stable if truncate WBC to 40000 (\(n=46\) above 40000)

- Disagreements between predictors and response. Ignore unless extreme values or another explanation
- Example: \(n=8000\), one extreme predictor value not on straight line relationship with other \((X,Y)\) \(\rightarrow\) \(\chi^{2}=36\) for \(H_{0}:\) linearity

Statistical Measures:

- Leverage: capacity to be influential (not necessarily infl.)

Diagonals of “hat matrix” \(H = X(X'X)^{-1}X'\) — measures how an obs. predicts its own response (D. A. Belsley et al., 1980) - \(h_{ii}>2(p+1)/n\) may signal a high leverage point (D. A. Belsley et al., 1980)
- DFBETAS: change in \(\hat{\beta}\) upon deletion of each obs, scaled by s.e.
- DFFIT: change in \(X\hat{\beta}\) upon deletion of each obs
- DFFITS: DFFIT standardized by s.e. of \(\hat{\beta}\)
- Some classify obs as overly influential when \(|{\rm DFFITS}|>2\sqrt{(p+1)/(n-p-1)}\) (D. A. Belsley et al., 1980)
- Others examine entire distribution for “outliers”
- No substitute for careful examination of data (Chatfield, 1991; Spence & Garrison, 1993)
- Maximum likelihood estimation requires 1-step approximations

N

- Level playing field (independent datasets, same no. candidate d.f., careful bootstrapping)
- Criteria:
- calibration
- discrimination
- face validity
- measurement errors in required predictors
- use of continuous predictors (which are usually better defined than categorical ones)
- omission of “insignificant” variables that nonetheless make sense as risk factors
- simplicity (though this is less important with the availability of computers)
- lack of fit for specific types of subjects

- Goal is to rank-order: ignore calibration
- Otherwise, dismiss a model having poor calibration
- Good calibration \(\rightarrow\) compare discrimination (e.g., \(R^{2}\) (Nagelkerke, 1991), model \(\chi^2\), Somers’ \(D_{xy}\), Spearman’s \(\rho\), area under ROC curve)
- Worthwhile to compare models on a measure not used to optimize either model, e.g., mean absolute error, median absolute error if using OLS
- Rank measures may not give enough credit to extreme predictions \(\rightarrow\) model \(\chi^{2}, R^{2}\), examine extremes of distribution of \(\hat{Y}\)
- Examine differences in predicted values from the two models
- See (Peek et al., 2007; Pencina et al., 2008, 2011, 2012) for discussions and examples of low power for testing differences in ROC areas, and for other approaches.

OPQ

See also Section 5.6

Greenland (2000) discusses many important points:

- Stepwise variable selection on confounders leaves important confounders uncontrolled
- Shrinkage is far superior to variable selection
- Variable selection does more damage to confidence interval widths than to point estimates
- Claims about unbiasedness of ordinary MLEs are misleading because they assume the model is correct and is the only model entertained
- “models need to be complex to capture uncertainty about the relations … an honest uncertainty assessment requires parameters for all effects that we know may be present. This advice is implicit in an anti-parsimony principle often attributed to L. J. Savage ‘All models should be as big as an elephant’ (see Draper, 1995)”

R

Greenland’s example of inadequate adjustment for confounders as a result of using a bad modeling strategy:

- Case-control study of diet, food constituents, breast cancer
- 140 cases, 222 controls
- 35 food constituent intakes and 5 confounders
- Food intakes are correlated
- Traditional stepwise analysis not adjusting simultaneously for all foods consumed \(\rightarrow\) 11 foods had \(P < 0.05\)
- Full model with all 35 foods competing \(\rightarrow\) 2 had \(P < 0.05\)
- Rigorous simultaneous analysis (hierarchical random slopes model) penalizing estimates for the number of associations examined \(\rightarrow\) no foods associated with breast cancer

S

**Global Strategies**

- Use a method known not to work well (e.g., stepwise variable selection without penalization; recursive partitioning), document how poorly the model performs (e.g. using the bootstrap), and use the model anyway
- Develop a black box model that performs poorly and is difficult to interpret (e.g., does not incorporate penalization)
- Develop a black box model that performs well and is difficult to interpret
- Develop interpretable approximations to the black box
- Develop an interpretable model (e.g. give priority to additive effects) that performs well and is likely to perform equally well on future data from the same stream

T

**Preferred Strategy in a Nutshell**

- Decide how many d.f. can be spent
- Decide where to spend them
- Spend them
- Don’t reconsider, especially if inference needed

U

- Assemble accurate, pertinent data and lots of it, with wide distributions for \(X\).
- Formulate good hypotheses — specify relevant candidate predictors and possible interactions. Don’t use \(Y\) to decide which \(X\)’s to include.
- Characterize subjects with missing \(Y\). Delete such subjects in rare circumstances (Crawford et al., 1995). For certain models it is effective to multiply impute \(Y\).
- Characterize and impute missing \(X\). In most cases use multiple imputation based on \(X\) and \(Y\)
- For each predictor specify complexity or degree of nonlinearity that should be allowed (more for important predictors or for large \(n\)) (Section 4.1)
- Do data reduction if needed (pre-transformations, combinations), or use penalized estimation (Frank E. Harrell et al., 1998)
- Use the entire sample in model development
- Can do highly structured testing to simplify “initial” model
- Test entire group of predictors with a single \(P\)-value
- Make each continuous predictor have same number of knots, and select the number that optimizes AIC
- Test the combined effects of all nonlinear terms with a single \(P\)-value

- Make tests of linearity of effects in the model only to demonstrate to others that such effects are often statistically significant. Don’t remove individual insignificant effects from the model.
- Check additivity assumptions by testing pre-specified interaction terms. Use a global test and either keep all or delete all interactions.
- Check to see if there are overly-influential observations.
- Check distributional assumptions and choose a different model if needed.
- Do limited backwards step-down variable selection if parsimony is more important that accuracy (Spiegelhalter, 1986). But confidence limits, etc., must account for variable selection (e.g., bootstrap).
- This is the “final” model.
- Interpret the model graphically and by computing predicted values and appropriate test statistics. Compute pooled tests of association for collinear predictors.
- Validate this model for calibration and discrimination ability, preferably using bootstrapping.
- Shrink parameter estimates if there is overfitting but no further data reduction is desired (unless shrinkage built-in to estimation)
- When missing values were imputed, adjust final variance-covariance matrix for imputation. Do this as early as possible because it will affect other findings.
- When all steps of the modeling strategy can be automated, consider using Faraway (1992) to penalize for the randomness inherent in the multiple steps.
- Develop simplifications to the final model as needed.

V

- Less need for parsimony; even less need to remove insignificant variables from model (otherwise CLs too narrow)
- Careful consideration of interactions; inclusion forces estimates to be conditional and raises variances
- If variable of interest is mostly the one that is missing, multiple imputation less valuable
- Complexity of main variable specified by prior beliefs, compromise between variance and bias
- Don’t penalize terms for variable of interest
- Model validation less necessary
- Like the hypothesis testing scenario covered next, when one is attempting to apply a causal interpretation to the effect being estimated, consideration of pathways and confounding are all-important. Collider bias and backdoor pathways can ruin interpretation, and so can the omission of confounder variables that would have provided alternate explanations for observed effects of interest. In pure prediction mode we don’t care very much how we got here, but in isolating the effect of a specific variable in the model, e.g., in estimating the effect of treatment in an observational treatment comparison study, measuring and adjusting for nearly all confounding in play are very important. Sometimes adjusting for confounding requires overfitting the model from a prediction standpoint, but that is OK. Overfitting can result in widering confidence intervals for the effect of interest, and that is a proper price to pay.

W

- Virtually same as previous strategy
- Interactions require tests of effect by varying values of another variable, or “main effect + interaction” joint tests (e.g., is treatment effective for either sex, allowing effects to be different)
- Validation may help quantify over-adjustment

X

**Section 4.1**

- An investigator is interested in finding risk factors for stroke. She has collected data on 3000 subjects of whom 70 were known to suffer a stroke within 5 years from study entry. She has decided that race (5 possible levels), age, height, and mean arterial blood pressure should definitely be in the model. She does not try to find previous studies that show how these potential risk factors relate to the risk of stroke. What is one way to decide how many degrees of freedom to spend on each predictor? How can one achieve these d.f. (whatever they are) for race and for a continuous variable?
- What is the driving force behind the approach used here to prespecify the complexity with which each predictor is specified in the regression model?
- Why is the process fair?
- When feasible to do, why is learning from a saturated model preferred?

**Section 4.3**

- What did we already study that variable selection is an extension of?
- What does variable selection ruin more than it ruins regression coefficient estimates?
- Besides those problems what’s the worst thing you can say about statistical test-based variable selection?
- In which situation would variable selection have a chance to be helpful?
- What is the statistical point that the Maxwell demon analogy is trying to bring out?
- What is the single root cause of problems that is in common to each of the following strategies?
- The investigator retains race as 5 categories and fits it with 4 dummy variables. She examines coefficients from these dummy variables to decide which races have similar coefficients and should be pooled.
- All continuous variables are allowed to be nonlinear using splines, and variables having insignificant nonlinear terms are re-fitted as linear.
- All potential risk factors are used as candidate predictor variables with forward stepwise variable selection with a significance level for entering the model of \(\alpha=0.1\).
- A clinical trial of 5 treatments find that 3 of the treatments result in nearly the same blood pressure reduction. These 3 treatments are pooled and their overall mean is compared with each of the remaining 2 treatments.

- Name at least one specific harm done by the strategies outlined in the last question, in general terms (not specific to each strategy).

**Section 4.4**

- Describe a major way that “breaking ties” in Y, i.e., having it more continuous, results in more power and precision in modeling.
- What is the effective sample size in a study with 10 cancer outcomes and 100,000 normal outcomes?

**Section 4.5**

- What are main causes of passive shrinkage/regression to the mean?
- Why does the heuristic shrinkage estimate work?
- What causes the graph of \(\hat{Y}\) (on \(x\)-axis) vs. \(Y\) (on \(y\)-axis; you could call this \(Y_{new}\)) to be flatter than a \(45^\circ\) line when the graph is made for new data not used to fit the model?

**Section 4.6**

- What causes the standard error of a regression coefficient to get large when adjusting for a competing predictor?
- When just wanting to measure strength of association or test a group of predictors, what is a simple fix for co-linearity?

**Section 4.7**

- What is the purpose of data reduction, and why does it not create phantom degrees of freedom that need to be accounted for later in the modeling process?
- Name several data reduction methods for reducing the number of regression coefficients to estimate.
- Principal components can have unstable loadings (e.g., if you bootstrap). Why are they still a valuable component in a data reduction strategy?
- What is the overall guidance for knowing how much unsupervised learning (data reduction) to do?

**Section 4.9**

- Why are high leverage (far out) observations influential on regression coefficients even when Y is binary?

**Section 4.10**

- What would make one model having much better predictive discrimination than another irrelevant?

**Section 4.12**

- Are the steps to the default strategy complete and is the sequencing “correct”?

Altman, D. G., & Andersen, P. K. (1989). Bootstrap investigation of the stability of a Cox regression model. *Stat Med*, *8*, 771–783.

Atkinson, A. C. (1980). A note on the generalized information criterion for choice of a model. *Biometrika*, *67*, 413–418.

Austin, P. C. (2008). Bootstrap model selection had similar performance for selecting authentic and noise variables compared to backward variable elimination: A simulation study. *J Clin Epi*, *61*, 1009–1017.

"in general, a bootstrap model selection method had comparable performance to conventional backward variable elimination for identifying the true regression model. In most settings, both methods performed poorly at correctly identifying the correct regression model."

Belsley, David A. (1991). *Conditioning Diagnostics: Collinearity and Weak Data in Regression*. Wiley.

Belsley, D. A., Kuh, E., & Welsch, R. E. (1980). *Regression Diagnostics: Identifying Influential Data and Sources of Collinearity*. Wiley.

Benedetti, J. K., Liu, P.-Y., Sather, H. N., Seinfeld, J., & Epton, M. A. (1982). Effective sample size for tests of censored survival data. *Biometrika*, *69*, 343–349.

Breiman, Leo. (1992). The little bootstrap and other methods for dimensionality selection in regression: X-fixed prediction error. *J Am Stat Assoc*, *87*, 738–754.

Breiman, L., & Friedman, J. H. (1985). Estimating optimal transformations for multiple regression and correlation (with discussion). *J Am Stat Assoc*, *80*, 580–619.

Chatfield, C. (1991). Avoiding statistical pitfalls (with discussion). *Stat Sci*, *6*, 240–268.

Chatfield, C. (1995). Model uncertainty, data mining and statistical inference (with discussion). *J Roy Stat Soc A*, *158*, 419–466.

bias by selecting model because it fits the data well; bias in standard errors;P. 420: ... need for a better balance in the literature and in statistical teaching between techniques and problem solving strategies. P. 421: It is “well known” to be “logically unsound and practically misleading” (Zhang, 1992) to make inferences as if a model is known to be true when it has, in fact, been selected from the same data to be used for estimation purposes. However, although statisticians may admit this privately (Breiman (1992) calls it a “quiet scandal”), they (we) continue to ignore the difficulties because it is not clear what else could or should be done. P. 421: Estimation errors for regression coefficients are usually smaller than errors from failing to take into account model specification. P. 422: Statisticians must stop pretending that model uncertainty does not exist and begin to find ways of coping with it. P. 426: It is indeed strange that we often admit model uncertainty by searching for a best model but then ignore this uncertainty by making inferences and predictions as if certain that the best fitting model is actually true. P. 427: The analyst needs to assess the model selection process and not just the best fitting model. P. 432: The use of subset selection methods is well known to introduce alarming biases. P. 433: ... the AIC can be highly biased in data-driven model selection situations. P. 434: Prediction intervals will generally be too narrow. In the discussion, Jamal R. M. Ameen states that a model should be (a) satisfactory in performance relative to the stated objective, (b) logically sound, (c) representative, (d) questionable and subject to on-line interrogation, (e) able to accommodate external or expert information and (f) able to convey information.

Chatterjee, S., & Hadi, A. S. (2012). *Regression Analysis by Example* (Fifth). Wiley.

Chavent, M., Kuentz-Simonet, V., Liquet, B., & Saracco, J. (2012). ClustOfVar: An R package for the clustering of variables. *J Stat Software*, *50*(13), 1–16.

Ciampi, A., Thiffault, J., Nakache, J. P., & Asselain, B. (1986). Stratification by stepwise regression, correspondence analysis and recursive partition. *Comp Stat Data Analysis*, *1986*, 185–204.

Copas, J. B. (1983). Regression, prediction and shrinkage (with discussion). *J Roy Stat Soc B*, *45*, 311–354.

Crawford, S. L., Tennstedt, S. L., & McKinlay, J. B. (1995). A comparison of analytic methods for non-random missingness of outcome data. *J Clin Epi*, *48*, 209–219.

Crichton, N. J., & Hinde, J. P. (1989). Correspondence analysis as a screening method for indicants for clinical diagnosis. *Stat Med*, *8*, 1351–1362.

D’Agostino, R. B., Belanger, A. J., Markson, E. W., Kelly-Hayes, M., & Wolf, P. A. (1995). Development of health risk appraisal functions in the presence of multiple indicators: The Framingham Study nursing home institutionalization model. *Stat Med*, *14*, 1757–1770.

Davis, C. E., Hyde, J. E., Bangdiwala, S. I., & Nelson, J. J. (1986). An example of dependencies among variables in a conditional logistic regression. In S. H. Moolgavkar & R. L. Prentice (Eds.), *Modern Statistical Methods in Chronic Disease Epidemiology* (pp. 140–147). Wiley.

Derksen, S., & Keselman, H. J. (1992). Backward, forward and stepwise automated subset selection algorithms: Frequency of obtaining authentic and noise variables. *British J Math Stat Psych*, *45*, 265–282.

Faraway, J. J. (1992). The cost of data analysis. *J Comp Graph Stat*, *1*, 213–229.

Friedman, J. H. (1984). *A variable span smoother* (Technical Report No. 5). Laboratory for Computational Statistics, Department of Statistics, Stanford University.

Giudice, J. H., Fieberg, J. R., & Lenarz, M. S. (2011). Spending degrees of freedom in a poor economy: A case study of building a sightability model for moose in northeastern minnesota. *J Wildlife Manage*. https://doi.org/10.1002/jwmg.213

Grambsch, P. M., & O’Brien, P. C. (1991). The effects of transformations and preliminary tests for non-linearity in regression. *Stat Med*, *10*, 697–709.

Gray, R. J. (1992). Flexible methods for analyzing survival data using splines, with applications to breast cancer prognosis. *J Am Stat Assoc*, *87*, 942–951.

Greenacre, M. J. (1988). Correspondence analysis of multivariate categorical data by weighted least-squares. *Biometrika*, *75*, 457–467.

Greenland, S. (2000). When should epidemiologic regressions use random coefficients? *Biometrics*, *56*, 915–921. https://doi.org/10.1111/j.0006-341X.2000.00915.x

use of statistics in epidemiology is largely primitive;stepwise variable selection on confounders leaves important confounders uncontrolled;composition matrix;example with far too many significant predictors with many regression coefficients absurdly inflated when overfit;lack of evidence for dietary effects mediated through constituents;shrinkage instead of variable selection;larger effect on confidence interval width than on point estimates with variable selection;uncertainty about variance of random effects is just uncertainty about prior opinion;estimation of variance is pointless;instead the analysis should be repeated using different values;"if one feels compelled to estimate $\tau{̂2}$, I would recommend giving it a proper prior concentrated amount contextually reasonable values";claim about ordinary MLE being unbiased is misleading because it assumes the model is correct and is the only model entertained;shrinkage towards compositional model;"models need to be complex to capture uncertainty about the relations...an honest uncertainty assessment requires parameters for all effects that we know may be present. This advice is implicit in an antiparsimony principle often attributed to L. J. Savage ’All models should be as big as an elephant (see Draper, 1995)’". See also gus06per.

Guo, J., James, G., Levina, E., Michailidis, G., & Zhu, J. (2011). Principal component analysis with sparse fused loadings. *J Comp Graph Stat*, *19*(4), 930–946.

incorporates blocking structure in the variables;selects different variables for different components;encourages loadings of highly correlated variables to have same magnitude, which aids in interpretation

Harrell, F. E. (1986). The LOGIST Procedure. In *SUGI Supplemental Library Users Guide* (Version 5, pp. 269–293). SAS Institute, Inc.

Harrell, F. E., Lee, K. L., Califf, R. M., Pryor, D. B., & Rosati, R. A. (1984). Regression modeling strategies for improved prognostic prediction. *Stat Med*, *3*, 143–152.

Harrell, Frank E., Lee, K. L., & Mark, D. B. (1996). Multivariable prognostic models: Issues in developing models, evaluating assumptions and adequacy, and measuring and reducing errors. *Stat Med*, *15*, 361–387.

Harrell, F. E., Lee, K. L., Matchar, D. B., & Reichert, T. A. (1985). Regression models for prognostic prediction: Advantages, problems, and suggested solutions. *Ca Trt Rep*, *69*, 1071–1077.

Harrell, Frank E., Margolis, P. A., Gove, S., Mason, K. E., Mulholland, E. K., Lehmann, D., Muhe, L., Gatchalian, S., & Eichenwald, H. F. (1998). Development of a clinical prediction model for an ordinal outcome: The World Health Organization ARI Multicentre Study of clinical signs and etiologic agents of pneumonia, sepsis, and meningitis in young infants. *Stat Med*, *17*, 909–944. http://onlinelibrary.wiley.com/doi/10.1002/(SICI)1097-0258(19980430)17:8%3C909::AID-SIM753%3E3.0.CO;2-O/abstract

Hoeffding, W. (1948). A non-parametric test of independence. *Ann Math Stat*, *19*, 546–557.

Hurvich, C. M., & Tsai, C. L. (1990). The impact of model selection on inference in linear regression. *Am Statistician*, *44*, 214–217.

Knaus, W. A., Harrell, F. E., Lynn, J., Goldman, L., Phillips, R. S., Connors, A. F., Dawson, N. V., Fulkerson, W. J., Califf, R. M., Desbiens, N., Layde, P., Oye, R. K., Bellamy, P. E., Hakim, R. B., & Wagner, D. P. (1995). The SUPPORT prognostic model: Objective estimates of survival for seriously ill hospitalized adults. *Ann Int Med*, *122*, 191–203. https://doi.org/10.7326/0003-4819-122-3-199502010-00007

Kuhfeld, W. F. (2009). The PRINQUAL Procedure. In *SAS/STAT 9.2 User’s Guide* (Second). SAS Publishing. http://support.sas.com/documentation/onlinedoc/stat

Lawless, J. F., & Singhal, K. (1978). Efficient screening of nonnormal regression models. *Biometrics*, *34*, 318–327.

le Cessie, S., & van Houwelingen, J. C. (1992). Ridge estimators in logistic regression. *Appl Stat*, *41*, 191–201.

Leclerc, A., Luce, D., Lert, F., Chastang, J. F., & Logeay, P. (1988). Correspondence analysis and logistic modelling: Complementary use in the analysis of a health survey among nurses. *Stat Med*, *7*, 983–995.

Lockhart, R., Taylor, J., Tibshirani, R. J., & Tibshirani, R. (2013). *A significance test for the lasso*. arXiv. http://arxiv.org/abs/1301.7161

Mantel, N. (1970). Why stepdown procedures in variable selection. *Technometrics*, *12*, 621–625.

McCabe, G. P. (1984). Principal variables. *Technometrics*, *26*, 137–144.

Michailidis, G., & de Leeuw, J. (1998). The Gifi system of descriptive multivariate analysis. *Stat Sci*, *13*, 307–336.

Myers, R. H. (1990). *Classical and Modern Regression with Applications*. PWS-Kent.

Nagelkerke, N. J. D. (1991). A note on a general definition of the coefficient of determination. *Biometrika*, *78*, 691–692.

Nick, T. G., & Hardin, J. M. (1999). Regression modeling strategies: An illustrative case study from medical rehabilitation outcomes research. *Am J Occ Ther*, *53*, 459–470.

Peduzzi, P., Concato, J., Feinstein, A. R., & Holford, T. R. (1995). Importance of events per independent variable in proportional hazards regression analysis. II. Accuracy and precision of regression estimates. *J Clin Epi*, *48*, 1503–1510.

Peduzzi, P., Concato, J., Kemper, E., Holford, T. R., & Feinstein, A. R. (1996). A simulation study of the number of events per variable in logistic regression analysis. *J Clin Epi*, *49*, 1373–1379.

Peek, N., Arts, D. G. T., Bosman, R. J., van der Voort, P. H. J., & de Keizer, N. F. (2007). External validation of prognostic models for critically ill patients required substantial sample sizes. *J Clin Epi*, *60*, 491–501.

large sample sizes need to obtain reliable external validations;inadequate power of DeLong, DeLong, and Clarke-Pearson test for differences in correlated ROC areas (p. 498);problem with tests of calibration accuracy having too much power for large sample sizes

Pencina, M. J., D’Agostino, R. B., & Demler, O. V. (2012). Novel metrics for evaluating improvement in discrimination: Net reclassification and integrated discrimination improvement for normal variables and nested models. *Stat Med*, *31*(2), 101–113. https://doi.org/10.1002/sim.4348

Pencina, M. J., D’Agostino, R. B., & Steyerberg, E. W. (2011). Extensions of net reclassification improvement calculations to measure usefulness of new biomarkers. *Stat Med*, *30*, 11–21. https://doi.org/10.1002/sim.4085

lack of need for NRI to be category-based;arbitrariness of categories;"category-less or continuous NRI is the most objective and versatile measure of improvement in risk prediction;authors misunderstood the inadequacy of three categories if categories are used;comparison of NRI to change in C index;example of continuous plot of risk for old model vs. risk for new model

Pencina, M. J., D’Agostino Sr, R. B., D’Agostino Jr, R. B., & Vasan, R. S. (2008). Evaluating the added predictive ability of a new marker: From area under the ROC curve to reclassification and beyond. *Stat Med*, *27*, 157–172.

small differences in ROC area can still be very meaningful;example of insignificant test for difference in ROC areas with very significant results from new method;Yates’ discrimination slope;reclassification table;limiting version of this based on whether and amount by which probabilities rise for events and lower for non-events when compare new model to old;comparing two models;see letter to the editor by Van Calster and Van Huffel, Stat in Med 29:318-319, 2010 and by Cook and Paynter, Stat in Med 31:93-97, 2012

Riley, R. D., Snell, K. I., Ensor, J., Burke, D. L., Harrell, F. E., Moons, K. G., & Collins, G. S. (2019). Minimum sample size for developing a multivariable prediction model: PART II - binary and time-to-event outcomes. *Stat Med*, *38*(7), 1276–1296. https://doi.org/10.1002/sim.7992

Riley, R. D., Snell, K. I., Ensor, J., Burke, D. L., Jr, F. E. H., Moons, K. G., & Collins, G. S. (2019). Minimum sample size for developing a multivariable prediction model: PART II - binary and time-to-event outcomes. *Statistics in Medicine*, *38*(7), 1276–1296. https://doi.org/10.1002/sim.7992

Roecker, E. B. (1991). Prediction error and its estimation for subset-selected models. *Technometrics*, *33*, 459–468.

Sarle, W. (1990). The VARCLUS Procedure. In *SAS/STAT User’s Guide* (fourth, Vol. 2, pp. 1641–1659). SAS Institute, Inc. http://support.sas.com/documentation/onlinedoc/stat

Sauerbrei, W., & Schumacher, M. (1992). A bootstrap resampling procedure for model building: Application to the Cox regression model. *Stat Med*, *11*, 2093–2109.

Smith, L. R., Harrell, F. E., & Muhlbaier, L. H. (1992). Problems and potentials in modeling survival. In M. L. Grady & H. A. Schwartz (Eds.), *Medical Effectiveness Research Data Methods (Summary Report), AHCPR Pub. No. 92-0056* (pp. 151–159). US Dept. of Health and Human Services, Agency for Health Care Policy and Research. https://hbiostat.org/bib/papers/smi92pro.pdf

Spence, I., & Garrison, R. F. (1993). A remarkable scatterplot. *Am Statistician*, *47*, 12–19.

Spiegelhalter, D. J. (1986). Probabilistic prediction in patient management and clinical trials. *Stat Med*, *5*, 421–433. https://doi.org/10.1002/sim.4780050506

z-test for calibration inaccuracy (implemented in Stata, and R Hmisc package’s val.prob function)

Steyerberg, E. W., Eijkemans, M. J. C., Harrell, F. E., & Habbema, J. D. F. (2001). Prognostic modeling with logistic regression analysis: In search of a sensible strategy in small data sets. *Med Decis Mak*, *21*, 45–56.

Sun, G.-W., Shook, T. L., & Kay, G. L. (1996). Inappropriate use of bivariable analysis to screen risk factors for use in multivariable analysis. *J Clin Epi*, *49*, 907–916.

van der Ploeg, T., Austin, P. C., & Steyerberg, E. W. (2014). Modern modelling techniques are data hungry: A simulation study for predicting dichotomous endpoints. *BMC Medical Research Methodology*, *14*(1), 137+. https://doi.org/10.1186/1471-2288-14-137

Would be better to use proper accuracy scores in the assessment. Too much emphasis on optimism as opposed to final discrimination measure. But much good practical information. Recursive partitioning fared poorly.

van Houwelingen, J. C., & le Cessie, S. (1990). Predictive value of statistical models. *Stat Med*, *9*, 1303–1325.

Verweij, P. J., & van Houwelingen, H. C. (1994). Penalized likelihood in Cox regression. *Stat Med*, *13*, 2427–2436.

Vittinghoff, E., & McCulloch, C. E. (2006). Relaxing the rule of ten events per variable in logistic and Cox regression. *Am J Epi*, *165*, 710–718.

the authors may have not been quite stringent enough in their assessment of adequacy of predictions;letter to the editor submitted

Wax, Y. (1992). Collinearity diagnosis for a relative risk regression analysis: An application to assessment of diet-cancer relationship in epidemiological studies. *Stat Med*, *11*, 1273–1287.

Whitehead, J. (1993). Sample size calculations for ordered categorical data. *Stat Med*, *12*, 2257–2271.

Wiegand, R. E. (2010). Performance of using multiple stepwise algorithms for variable selection. *Stat Med*, *29*, 1647–1659.

fruitless to try different stepwise methods and look for agreement;the methods will agree on the wrong model

Young, F. W., Takane, Y., & de Leeuw, J. (1978). The principal components of mixed measurement level multivariate data: An alternating least squares method with optimal scaling features. *Psychometrika*, *43*, 279–281.

```
```{r include=FALSE}
options(qproject='rms', prType='html')
require(Hmisc)
require(qreport)
getRs('qbookfun.r')
hookaddcap()
knitr::set_alias(w = 'fig.width', h = 'fig.height', cap = 'fig.cap', scap ='fig.scap')
```
# Multivariable Modeling Strategies {#sec-multivar}
`r mrg(sound("mv-1"))`
* "Spending d.f.": examining or fitting parameters in models, or `r ipacue()`
examining tables or graphs that utilize $Y$ to tell you how to
model variables
* If wish to preserve statistical properties, can't retrieve d.f.
once they are "spent" (see Grambsch \& O'Brien)
* If a scatterplot suggests linearity and you fit a linear model,
how many d.f. did you actually spend (i.e., the d.f. that when put
into a formula results in accurate confidence limits or $P$-values)?
* Decide number of d.f. that can be spent `r ipacue()`
* Decide where to spend them
* Spend them
* General references: [@nic99reg; @ste01pro; @har96mul; @giu11spe]
There are many choices to be made when deciding upon a global modeling
strategy, including choice between
* parametric and nonparametric procedures `r ipacue()`
* parsimony and complexity
* parsimony and good discrimination ability
* interpretable models and black boxes.
## Prespecification of Predictor Complexity Without Later Simplification {#sec-multivar-complexity}
`r mrg(sound("mv-2"))`
* Rarely expect linearity `r ipacue()`
* Can't always use graphs or other devices to choose
transformation
* If select from among many transformations, results biased
* Need to allow flexible nonlinearity to potentially strong
predictors not _known_ to predict linearly
* Once decide a predictor is "in" can choose no. of
parameters to devote to it using a general association index
with $Y$
* Need a measure of "potential predictive punch"
* Measure needs to mask analyst to true form of regression to
preserve statistical properties
Motivating examples:
```{r tradeoffs,h=6,w=6}
#| label: fig-multivar-fitting-errors
#| fig-cap: "Fitting errors to withstand or to avoid"
# Overfitting a flat relationship
require(rms)
set.seed(1)
x <- runif(1000)
y <- runif(1000, -0.5, 0.5)
dd <- datadist(x, y); options(datadist='dd')
par(mfrow=c(2,2), mar=c(2, 2, 3, 0.5))
pp <- function(actual) {
yhat <- predict(f, data.frame(x=xs))
yreal <- actual(xs)
plot(0, 0, xlim=c(0,1),
ylim=range(c(quantile(y, c(0.1, 0.9)), yhat,
yreal)),
type='n', axes=FALSE)
axis(1, labels=FALSE); axis(2, labels=FALSE)
lines(xs, yreal)
lines(xs, yhat, col='blue')
}
f <- ols(y ~ rcs(x, 5))
xs <- seq(0, 1, length=150)
pp(function(x) 0*x)
title('Mild Error:\nOverfitting a Flat Relationship',
cex=0.5)
y <- x + runif(1000, -0.5, 0.5)
f <- ols(y ~ rcs(x, 5))
pp(function(x) x)
title('Mild Error:\nOverfitting a Linear Relationship',
cex=0.5)
y <- x^4 + runif(1000, -1, 1)
f <- ols(y ~ x)
pp(function(x) x^4)
title('Serious Error:\nUnderfitting a Steep Relationship',
cex=0.5)
y <- - (x - 0.5) ^ 2 + runif(1000, -0.2, 0.2)
f <- ols(y ~ x)
pp(function(x) - (x - 0.5) ^ 2)
title('Tragic Error:\nMonotonic Fit to\nNon-Monotonic Relationship',
cex=0.5)
```
| | |
|----------|-----------|
| Categorical predictor with $k$ levels | Collapse less frequent categories into "other" |
| Continuous predictor represented as $k$-knot restricted cubic spline | Reduce $k$ to a number as low as 3, or 0 (linear) |
: Examples of Reducing the Number of Parameters
`r ipacue()`
### Learning From a Saturated Model
When the effective sample size available is sufficiently large so that
a saturated main effects model may be fitted, a good approach to
gauging predictive potential is the following.
* Let all continuous predictors be represented as restricted cubic `r ipacue()`
splines with $k$ knots, where $k$ is the maximum number of knots the
analyst entertains for the current problem.
* Let all categorical predictors retain their original categories
except for pooling of very low prevalence categories (e.g., ones
containing $<6$ observations).
* Fit this general main effects model.
* Compute the partial $\chi^2$ statistic for testing the
association of each predictor with the response, adjusted for all
other predictors. In the case of ordinary regression convert
partial $F$ statistics to $\chi^2$ statistics or partial $R^2$
values.
* Make corrections for chance associations to "level the playing
field" for predictors having greatly varying d.f., e.g., subtract
the d.f. from the partial $\chi^2$ (the expected value of
$\chi^{2}_{p}$ is $p$ under $H_{0}$).
* Make certain that tests of nonlinearity are not revealed as this
would bias the analyst.
* Sort the partial association statistics in descending order.
Commands in the `rms` package can be used to plot only what is
needed. Here is an example for a logistic model.
```{r eval=FALSE}
f <- lrm(y ~ sex + race + rcs(age,5) + rcs(weight,5) +
rcs(height,5) + rcs(blood.pressure,5))
plot(anova(f))
```
### Using Marginal Generalized Rank Correlations
When collinearities or confounding are not problematic, a quicker
approach based on pairwise measures of association can be useful.
This approach will not have numerical problems (e.g., singular
covariance matrix) and is based on:
* 2 d.f. generalization of Spearman $\rho$---$R^2$ based on `r ipacue()`
$rank(X)$ and $rank(X)^2$ vs. $rank(Y)$
* $\rho^2$ can detect U-shaped relationships
* For categorical $X$, $\rho^2$ is $R^2$ from dummy variables
regressed against $rank(Y)$; this is tightly related to the
Wilcoxon-Mann-Whitney-Kruskal-Wallis rank test for group
differences^[This test statistic does not inform the analyst of _which_ groups are different from one another.]
* Sort variables by descending order of $\rho^2$
* Specify number of knots for continuous $X$, combine infrequent
categories of categorical $X$ based on $\rho^2$
Allocating d.f. based on partial tests of association or sorting
$\rho^2$ is a fair procedure because
* We already decided to keep variable in model no matter what `r ipacue()`
$\rho^2$ or $\chi^2$ values are seen
* $\rho^2$ and $\chi^2$ do not reveal degree of nonlinearity;
high value may be due solely to strong linear effect
* low $\rho^2$ or $\chi^2$ for a categorical variable might lead to
collapsing the most disparate categories
Initial simulations show the procedure to be conservative. Note that
one can move from simpler to more complex models but not the other
way round
## Checking Assumptions of Multiple Predictors Simultaneously
* Sometimes failure to adjust for other variables gives wrong `r ipacue()`
transformation of an $X$, or wrong significance of interactions
* Sometimes unwieldy to deal simultaneously with all predictors
at each stage $\rightarrow$ assess regression assumptions separately
for each predictor
## Variable Selection {#sec-multivar-variable-selection}
`r mrg(sound("mv-3"))`
* Series of potential predictors with no prior knowledge `r ipacue()`
* $\uparrow$ exploration $\rightarrow$ $\uparrow$ shrinkage (overfitting)
* Summary of problem: $E(\hat{\beta} | \hat{\beta}$
"significant" $) \neq \beta$ [@cha95mod]
* Biased $R^2$, $\hat{\beta}$, standard errors, $P$-values too small
* $F$ and $\chi^2$ statistics do not have the claimed
distribution^[@loc13sig provide an example with $n=100$ and 10 orthogonal predictors where all true $\beta$s are zero. The test statistic for the first variable to enter has type I assertion probability of 0.39 when the nominal $\alpha$ is set to 0.05.] [@gra91]
* Will result in residual confounding if use variable selection
to find confounders [@gre00whe]
* @der92bac found that in stepwise analyses the final model represented noise 0.20-0.74 of time, final model usually contained $< \frac{1}{2}$ actual number of authentic predictors. Also: `r ipacue()`
>> 1. The degree of correlation between the predictor variables affected
the frequency with which authentic predictor variables found their way
into the final model.
>> 2. The number of candidate predictor variables affected the number of
noise variables that gained entry to the model.
>> 3. The size of the sample was of little practical importance in
determining the number of authentic variables contained in the final
model.
>> 4. The population multiple coefficient of determination could be
faithfully estimated by adopting a statistic that is adjusted by
the total number of candidate predictor variables rather than the
number of variables in the final model.
* Global test with $p$ d.f. insignificant $\rightarrow$ **stop**
Simulation experiment, true $\sigma^{2} = 6.25$, 8 candidate `r ipacue()`
variables, 4 of them related to $Y$ in the population. Select best
model using all possible subsets regression to maximize $R^{2}_{adj}$ (all possible subsets is not usually recommended but gives variable selection more of a chance to work in this context).
`r mrg(sound("mv-4"))`
**Note**: The audio was made using stepAIC with collinearities in predictors. The code below allows for several options. Here we use all possible subsets of predictors and force predictors to be uncorrelated, which is the easiest case for variable selection.
```{r}
require(MASS) # provides stepAIC function
require(leaps) # provides regsubsets function
sim <- function(n, sigma=2.5, method=c('stepaic', 'leaps'),
pr=FALSE, prcor=FALSE, dataonly=FALSE) {
method <- match.arg(method)
if(uncorrelated) {
x1 <- rnorm(n)
x2 <- rnorm(n)
x3 <- rnorm(n)
x4 <- rnorm(n)
x5 <- rnorm(n)
x6 <- rnorm(n)
x7 <- rnorm(n)
x8 <- rnorm(n)
}
else {
x1 <- rnorm(n)
x2 <- x1 + 2.0 * rnorm(n) # was + 0.5 * rnorm(n)
x3 <- rnorm(n)
x4 <- x3 + 1.5 * rnorm(n)
x5 <- x1 + rnorm(n)/1.3
x6 <- x2 + 2.25 * rnorm(n) # was rnorm(n)/1.3
x7 <- x3 + x4 + 2.5 * rnorm(n) # was + rnorm(n)
x8 <- x7 + 4.0 * rnorm(n) # was + 0.5 * rnorm(n)
}
z <- cbind(x1,x2,x3,x4,x5,x6,x7,x8)
if(prcor) return(round(cor(z), 2))
lp <- x1 + x2 + .5*x3 + .4*x7
y <- lp + sigma*rnorm(n)
if(dataonly) return(list(x=z, y=y))
if(method == 'leaps') {
s <- summary(regsubsets(z, y))
best <- which.max(s$adjr2)
xvars <- s$which[best, -1] # remove intercept
ssr <- s$rss[best]
p <- sum(xvars)
xs <- if(p == 0) 'none' else paste((1 : 8)[xvars], collapse='')
if(pr) print(xs)
ssesw <- (n - 1) * var(y) - ssr
s2s <- ssesw / (n - p - 1)
yhat <- if(p == 0) mean(y) else fitted(lm(y ~ z[, xvars]))
}
f <- lm(y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8)
if(method == 'stepaic') {
g <- stepAIC(f, trace=0)
p <- g$rank - 1
xs <- if(p == 0) 'none' else
gsub('[ <br>+x]','', as.character(formula(g))[3])
if(pr) print(formula(g), showEnv=FALSE)
ssesw <- sum(resid(g)^2)
s2s <- ssesw/g$df.residual
yhat <- fitted(g)
}
# Set SSEsw / (n - gdf - 1) = true sigma^2
gdf <- n - 1 - ssesw / (sigma^2)
# Compute root mean squared error against true linear predictor
rmse.full <- sqrt(mean((fitted(f) - lp) ^ 2))
rmse.step <- sqrt(mean((yhat - lp) ^ 2))
list(stats=c(n=n, vratio=s2s/(sigma^2),
gdf=gdf, apparentdf=p, rmse.full=rmse.full, rmse.step=rmse.step),
xselected=xs)
}
rsim <- function(B, n, method=c('stepaic', 'leaps')) {
method <- match.arg(method)
xs <- character(B)
r <- matrix(NA, nrow=B, ncol=6)
for(i in 1:B) {
w <- sim(n, method=method)
r[i,] <- w$stats
xs[i] <- w$xselected
}
colnames(r) <- names(w$stats)
s <- apply(r, 2, median)
p <- r[, 'apparentdf']
s['apparentdf'] <- mean(p)
print(round(s, 2))
print(table(p))
cat('Prob[correct model]=', round(sum(xs == '1237')/B, 2), '\n')
}
```
Show the correlation matrix being assumed for the $X$s:
```{r}
uncorrelated <- TRUE
sim(50000, prcor=TRUE)
```
Simulate to find the distribution of the number of variables selected, `r ipacue()`
the proportion of simulations in which the true model ($X_{1}, X_{2},
X_{3}, X_{7}$) was found, the median value of
$\hat{\sigma}^{2}/\sigma^{2}$, the median effective d.f., and the mean
number of apparent d.f., for varying sample sizes.
```{r cache=FALSE}
set.seed(11)
m <- 'leaps' # all possible regressions stopping on R2adj
rsim(100, 20, method=m) # actual model found twice out of 100
rsim(100, 40, method=m)
rsim(100, 150, method=m)
rsim(100, 300, method=m)
rsim(100, 2000)
```
As $n\uparrow$ the mean number of variables selected increased. The
proportion of simulations in which the correct model was found
increased from 0 to 0.53. $\sigma^{2}$ is underestimated when $n=300$ by a factor of 0.42, resulting in the d.f. needed to de-bias
$\hat{\sigma^{2}}$ being greater than $n$ when the apparent d.f. was only 5.16 on
the average. Variable selection slightly increased closeness
to the true $X\beta$.
If the simulations are re-run allowing for collinearities (`uncorrelated=FALSE`) one can expect variable selection to be even more problematic.
Variable selection methods [@har86log]:
`r mrg(sound("mv-5"))`
* Forward selection, backward elimination `r ipacue()`
* Stopping rule: "residual $\chi^{2}$" with d.f. = no.
candidates remaining at current step
* Test for significance or use Akaike's information criterion
(AIC [@atk80]), here $\chi^{2}-2 \times d.f.$
* Better to use subject matter knowledge!
* No currently available stopping rule was developed for
stepwise, only for comparing a limited number of pre-specified
models[@bre92lit, Section 1.3]
* @roe91pre studied forward selection (FS), all `r ipacue()`
possible subsets selection (APS), full fits
* APS more likely to select smaller, less accurate models than FS
* Neither as accurate as full model fit unless more than half of
candidate variables redundant or unnecessary
* Step-down is usually better than forward [@man70why] and can
be used efficiently with maximum likelihood
estimation [@law78]
* Fruitless to try different stepwise methods to look for
agreement [@wie10per]
* Bootstrap can help decide between full and reduced model `r ipacue()`
* Full model fits gives meaningful confidence intervals with
standard formulas, C.I. after stepwise does
not [@alt89; @hur90; @bre92lit]
* Data reduction (grouping variables) can help
* Using the bootstrap to select important variables
for inclusion in the final model [@sau92boo] is
problematic [@aus08boo]
* It is not logical that a population regression coefficient
would be exactly zero just because its estimate was "insignificant"
See also these articles:
* [Step away from stepwise](https://journalofbigdata.springeropen.com/articles/10.1186/s40537-018-0143-6) by Gary Smith `r ipacue()`
* [Five myths about variable selection](https://onlinelibrary.wiley.com/doi/full/10.1111/tri.12895) by Georg Heinze and Daniela Dunkler
* [Variable selection - A review and recommendations for the practicing statistician](https://onlinelibrary.wiley.com/doi/full/10.1002/bimj.201700067) by Georg Heinze, Christine Wallisch, Daniela Dunkler
* [Stopping stepwise](https://towardsdatascience.com/stopping-stepwise-why-stepwise-selection-is-bad-and-what-you-should-use-instead-90818b3f52df) by Peter Flom
### Maxwell's Demon as an Analogy to Variable Selection
`r mrg(sound("mv-5a"))`
Some of the information in the data is spent on variable selection
instead of using all information for estimation.
Model _specification_ is preferred to model _selection_.
Information content of the data usually insufficient for reliable
variable selection.
<img src="maxwellsdemon.jpg" width="65%"> <img src="maxwell.jpg" width="25%">[James Clerk Maxwell]{.aside}
<small>Maxwell imagines one container divided into two parts, A and B. Both parts are filled with the same gas at equal temperatures and placed next to each other. Observing the molecules on both sides, an imaginary demon guards a trapdoor between the two parts. When a faster-than-average molecule from A flies towards the trapdoor, the demon opens it, and the molecule will fly from A to B. Likewise, when a slower-than-average molecule from B flies towards the trapdoor, the demon will let it pass from B to A. The average speed of the molecules in B will have increased while in A they will have slowed down on average. Since average molecular speed corresponds to temperature, the temperature decreases in A and increases in B, contrary to the second law of thermodynamics. </small>
Szilárd pointed out that a real-life Maxwell's demon would need to have some means of measuring molecular speed, and that the act of acquiring information would require an expenditure of energy. Since the demon and the gas are interacting, we must consider the total entropy of the gas and the demon combined. The expenditure of energy by the demon will cause an increase in the entropy of the demon, which will be larger than the lowering of the entropy of the gas.
Source: [commons.wikimedia.org/wiki/File:YoungJamesClerkMaxwell.jpg](commons.wikimedia.org/wiki/File:YoungJamesClerkMaxwell.jpg)<br>
[en.wikipedia.org/wiki/Maxwell's_demon](en.wikipedia.org/wiki/Maxwell's_demon)
Peter Ellis' [blog article](https://ellisp.github.io/blog/2017/03/12/stepwise-timeseries) contains excellent examples of issues
discussed here but applied to time series modeling.
## Overfitting and Limits on Number of Predictors {#sec-multivar-overfit}
`r mrg(sound("mv-6"))`
* Concerned with avoiding overfitting `r ipacue()`
* Assume typical problem in medicine, epidemiology, and the
social sciences in which the signal:noise ratio is small (higher
ratios allow for more aggressive modeling)
* $p$ should be $< \frac{m}{15}$ [@har84; @har85reg; @smi92pro; @ped95imp; @ped96sim; @vit06rel; @plo14mod]
* $p$ is number of parameters in full model or number of _candidate_
parameters in a stepwise analysis
* Derived from simulations to find minimum sample size so that
apparent discrimination = validated discrimination
* Applies to typical signal:noise ratios found outside of tightly
controlled experiments
* If true $R^{2}$ is high, many parameters can be estimated from
smaller samples
* Ignores sample size needed just to estimate the intercept or, in semiparametric models, the underlying distribution function^[The sample size needed for these is model-dependent]
* @ril19min and @ril19mina have refined sample size
estimation for continuous, binary, and time-to-event models to
account for all of these issues
* To just estimate $\sigma$ in a linear model with a multiplicative margin of error of 1.2 with 0.95 confidence requires $n=70$
<!-- NEW changed eff sample size for binary -->
`r fa <- '^[If one considers the power of a two-sample binomial test compared with a Wilcoxon test if the response could be made continuous and the proportional odds assumption holds, the effective sample size for a binary response is $3 n_{1}n_{2}/n \\approx 3 \\min(n_{1}, n_{2})$ if $\\frac{n_{1}}{n}$ is near 0 or 1 [@whi93sam, Eq. 10, 15]. Here $n_{1}$ and $n_{2}$ are the marginal frequencies of the two response levels [@ped96sim]. The effective sample size is a maximum ($0.75n$) when $n_{1}=n_{2}$, i.e. $p=\\frac{1}{2}$ .]'`
`r fb <- '^[Based on the power of a proportional odds model two-sample test when the marginal cell sizes for the response are $n_{1}, \\ldots, n_{k}$, compared with all cell sizes equal to unity (response is continuous) [@whi93sam, Eq. 3]. If all cell sizes are equal, the relative efficiency of having $k$ response categories compared to a continuous response is $1 - \\frac{1}{k^{2}}$ [@whi93sam, Eq. 14], e.g., a 5-level response is almost as efficient as a continuous one if proportional odds holds across category cutoffs.]'`
`r fc <- '^[This is approximate, as the effective sample size may sometimes be boosted somewhat by censored observations, especially for non-proportional hazards methods such as Wilcoxon-type tests [@ben82eff].]'`
`r ipacue()`
| Type of Response Variable | Limiting Sample Size $m$ |
|-----|-----|
| Continuous | $n$ (total sample size) |
| Binary | $3np(1-p), p=\frac{n_{2}}{n}$ `r fa` |
| Ordinal ($k$ categories) | $n - \frac{1}{n^{2}} \sum_{i=1}^{k} n_{i}^{3}$ `r fb` |
| Failure (survival) time | number of failures `r fc` |
: Limiting Sample Sizes for Various Response Variable Types
* Narrowly distributed predictor $\rightarrow$ even higher $n$ `r ipacue()`
* $p$ includes _all_ variables screened for association
with response, including interactions
* Univariable screening (graphs, crosstabs, etc.) **in no way**
reduces multiple comparison problems of model building [@sun96ina]
<!-- NEW -->
::: {.callout-note collapse="true"}
## More About the Binary $Y$ Case
To derive the effective sample size for binary $Y$ we compare two samples of equal size $\frac{n}{2}$ by testing whether the log odds ratio is zero. When the two samples come from populations with equal $\Pr(Y=1) = p$, the variance of the log odds ratio is approximately $2 \times \frac{1}{\frac{n}{2}p(1-p)} = \frac{4}{np(1-p)}$. When $Y$ is continuous with no ties and a Wilcoxon or proportional odds model test is used, the reciprocal of the variance of the proportional odds model's log odds ratio is $\frac{n^{3}}{12 (n+1)^{2}} \times (1 - \frac{1}{n^{2}}) \approx \frac{n}{12}$ so the variance is approximately $\frac{12}{n}$. The ratio of this to the binary $Y$ variance is $3p(1-p)$.
:::
## Shrinkage {#sec-multivar-est-use-shrink}
`r mrg(movie("https://youtu.be/yWoKLDy8IQA"), sound("mv-7"))`
* Slope of calibration plot; regression to the mean `r ipacue()`
* Statistical estimation procedure --- "pre-shrunk" models
* Aren't regression coefficients OK because they're unbiased?
* Problem is in how we use coefficient estimates
* Consider 20 samples of size $n=50$ from $U(0,1)$
* Compute group means and plot in ascending order
* Equivalent to fitting an intercept and 19 dummies using least
squares
* Result generalizes to general problems in plotting $Y$ vs.\
$X\hat{\beta}$
```{r ps=7,cap='Sorted means from 20 samples of size 50 from a uniform $[0,1]$ distribution. The reference line at 0.5 depicts the true population value of all of the means.',scap='Means from 20 $U(0,1)$ samples'}
#| label: fig-multivar-shrink-groupmeans
set.seed(123)
n <- 50
y <- runif(20*n)
group <- rep(1:20,each=n)
ybar <- tapply(y, group, mean)
ybar <- sort(ybar)
plot(1:20, ybar, type='n', axes=FALSE, ylim=c(.3,.7),
xlab='Group', ylab='Group Mean')
lines(1:20, ybar)
points(1:20, ybar, pch=20, cex=.5)
axis(2)
axis(1, at=1:20, labels=FALSE)
for(j in 1:20) axis(1, at=j, labels=names(ybar)[j])
```
* Prevent shrinkage by using pre-shrinkage `r ipacue()`
* @spi86: var. selection arbitrary, better
prediction usually results from fitting all candidate
variables and using shrinkage
* Shrinkage closer to that expected from full model fit than based on
number of significant variables [@cop83reg]
* Ridge regression [@ces92rid; @hou90pre]
* Penalized MLE [@ver94pen; @gra92fle; @har98dev]
* Heuristic shrinkage parameter of van Houwelingen and le `r ipacue()`
Cessie [@hou90pre, Eq. 77]
$$\hat{\gamma} = \frac{{\rm model}\ \chi^{2} - p}{{\rm model}\ \chi^{2}},$$ {#eq-heuristic-shrink}
* OLS^[An excellent discussion about such indexes may be found [here](https://stats.stackexchange.com/questions/48703).]:
$\hat{\gamma} = \frac{n-p-1}{n-1} R^{2}_{\rm adj} / R^{2}$ <br>
$R^{2}_{\rm adj} = 1 - (1 - R^{2})\frac{n-1}{n-p-1}$
* $p$ close to no. candidate variables
* @cop83reg (Eq. 8.5) adds 2 to numerator
## Collinearity {#sec-multivar-collinear}
`r mrg(sound("mv-8"))`
* When at least 1 predictor can be predicted well from others `r ipacue()`
* Can be a blessing (data reduction, transformations)
* $\uparrow$ s.e. of $\hat{\beta}$, $\downarrow$ power
* This is appropriate $\rightarrow$ asking too much of the
data [@chaBookreg, Chap. 9]
* Variables compete in variable selection, chosen one
arbitrary
* Does not affect joint influence of a set of highly correlated
variables (use multiple d.f. tests)
* Does not at all affect predictions on model construction
sample `r ipacue()`
* Does not affect predictions on new data [@mye90, pp. 379-381] if
+ Extreme extrapolation not attempted
+ New data have same type of collinearities as original data
* Example: LDL and total cholesterol -- problem only if more inconsistent
in new data
* Example: age and age$^{2}$ -- no problem
* One way to quantify for each predictor:
variance inflation factors (VIF)
* General approach (maximum likelihood) --- transform information
matrix to correlation form, VIF=diagonal of
inverse [@dav86; @wax92col]
* See @bel91con, pp. 28-30 for problems with VIF
* Easy approach: SAS `VARCLUS` procedure [@varclus], `r ipacue()`
`R` `varclus` function,
other clustering techniques: group highly correlated variables
* Can score each group (e.g., first principal component,
$PC_1$ [@dag95dev]); summary scores not collinear
## Data Reduction {#sec-multivar-data-reduction}
`r mrg(sound("mv-9"))`
* Unless $n>>p$, model unlikely to validate `r ipacue()`
* Data reduction: $\downarrow p$
* Use the literature to eliminate unimportant variables.
* Eliminate variables whose distributions are too narrow.
* Eliminate candidate predictors that are missing in a large
number of subjects, especially if those same predictors are likely
to be missing for future applications of the model.
* Use a statistical data reduction method such as incomplete
principal components regression, nonlinear
generalizations of principal components such as principal surfaces,
sliced inverse regression, variable clustering,
or ordinary cluster analysis on a measure of similarity
between variables.
* Data reduction is **completely masked to $Y$**, which is precisely why it does not distort estimates, standard errors, $P$-values, or confidence limits
* Data reduction = _unsupervised learning_
* Example: dataset with 40 events and 60 candidate predictors
+ Use variable clustering to group variables by correlation structure
+ Use clinical knowledge to refine the clusters
+ Keep age and severity of disease as separate predictors because of their strength
+ For others create clusters: socioeconomic, risk factors/history, and physiologic function
+ Summarize each cluster with its first principal component $PC_{1}$, i.e., the linear combination of characteristics that maximizes variance of the score across subjects subject to an overall constraint on the coefficients
+ Fit outcome model with 5 predictors
* See [this](https://stats.stackexchange.com/questions/5774) for useful resources related to principal component analysis of mixed continuous and categorical variables
* See [this](https://stats.stackexchange.com/questions/239898) for an excellent discussion of whether data reduction (unsupervised learning) steps need to be included in resampling for model validation
### Redundancy Analysis {#sec-multivar-redun}
* Remove variables that have poor distributions `r ipacue()`
+ E.g., categorical variables with fewer than 2 categories having
at least 20 observations
* Use flexible additive parametric additive models to
determine how well each variable can be predicted from the remaining
variables
* Variables dropped in stepwise fashion, removing the
most predictable variable at each step
* Remaining variables used to predict
* Process continues until no variable still in the list
of predictors can be predicted with an $R^2$ or adjusted $R^2$ greater
than a specified threshold or until dropping the variable with the highest
$R^2$ (adjusted or ordinary) would cause a variable that was dropped
earlier to no longer be predicted at the threshold from
the now smaller list of predictors
* `R` function `redun` in `Hmisc` package
* Related to _principal variables_ [@mcc84pri] but faster
### Variable Clustering
* Goal: Separate variables into groups `r ipacue()`
+ variables within group correlated with each other
+ variables not correlated with non-group members
* Score each dimension, stop trying to separate effects of
factors measuring same phenomenon
* Variable clustering [@varclus; @dag95dev] (oblique-rotation PC analysis) $\rightarrow$
separate variables so that first PC is representative of group
* Can also do hierarchical cluster analysis on similarity matrix
based on squared Spearman or Pearson correlations, or more
generally, Hoeffding's $D$ [@hoe48non].
* See @guo11pri for a method related to variable clustering
and sparse principal components.
* @cha12clu implement many more variable clustering methods
Example: @fig-cony-redun
### Transformation and Scaling Variables Without Using $Y$
`r mrg(sound("mv-9a"))`
* Reduce $p$ by estimating transformations using associations `r ipacue()`
with other predictors
* Purely categorical predictors -- correspondence
analysis [@lec88cor; @cri89cor; @cia86str; @gre88cor; @mic98gif]
* Mixture of qualitative and continuous variables: qualitative
principal components
* Maximum total variance (MTV) of Young, Takane, de
Leeuw [@you78; @mic98gif]
1. Compute $PC_1$ of variables using correlation matrix
1. Use regression (with splines, dummies, etc.) to predict
$PC_1$ from each $X$ --- expand each $X_j$ and regress
it separately on $PC_1$ to get working transformations
1. Recompute $PC_1$ on transformed $X$s
1. Repeat 3-4 times until variation explained by $PC_1$
plateaus and transformations stabilize
* Maximum generalized variance (MGV) method of
Sarle [@prinqual, pp. 1267-1268]
1. Predict each variable from (current transformations of)
all other variables
1. For each variable, expand it into linear and nonlinear
terms or dummies, compute first canonical variate
1. For example, if there are only two variables $X_{1}$ and
$X_{2}$ represented as quadratic polynomials, solve for
$a,b,c,d$ such that
$aX_{1}+bX_{1}^{2}$ has maximum correlation with
$cX_{2}+dX_{2}^{2}$.
1. Goal is to transform each var. so that it is most similar
to predictions from other transformed variables
1. Does not rely on PCs or variable clustering
* MTV (PC-based instead of canonical var.)
and MGV implemented in SAS `PROC PRINQUAL` [@prinqual]
1. Allows flexible transformations including monotonic splines
1. Does not allow restricted cubic splines, so may be unstable unless
monotonicity assumed
1. Allows simultaneous imputation but often yields wild estimates
### Simultaneous Transformation and Imputation {#sec-multivar-trans-impute}
`R` `transcan` Function for Data Reduction & Imputation
* Initialize missings to medians (or most frequent category) `r ipacue()`
* Initialize transformations to original variables
* Take each variable in turn as $Y$
* Exclude obs. missing on $Y$
* Expand $Y$ (spline or dummy variables)
* Score (transform $Y$) using first canonical variate
* Missing $Y$ $\rightarrow$ predict canonical variate from $X$s
* The imputed values can optionally be shrunk to avoid
overfitting for small $n$ or large $p$
* Constrain imputed values to be in range of non-imputed ones
* Imputations on original scale
+ Continuous $\rightarrow$ back-solve with linear interpolation
+ Categorical $\rightarrow$ classification tree (most freq. cat.) or
match to category whose canonical score is closest
to one predicted
* Multiple imputation --- bootstrap or approx. Bayesian bootstrap.
+ Sample residuals multiple times (default $M=5$)
+ Are on "optimally" transformed scale
+ Back-transform
+ `fit.mult.impute` works with `aregImpute` and
`transcan` output to easily get imputation-corrected variances and
avg. $\hat{\beta}$
* Option to insert constants as imputed values (ignored during
transformation estimation); helpful when a lab value may
be missing because the patient returned to normal
* Imputations and transformed values may be easily
obtained for new data
* An `R` function `Function` will create a series of
`R` functions that transform each predictor
* Example: $n=415$ acutely ill patients
+ Relate heart rate to mean arterial blood pressure
+ Two blood pressures missing
+ Heart rate not monotonically related to blood pressure
+ See @fig-multivar-physiol-transcan and @fig-multivar-physiol-transcan2
<!-- NEW replaced plot with ggplot added trantab=TRUE -->
```{r cap='Transformations fitted using `transcan`. Tick marks indicate the two imputed values for blood pressure.',scap='`transcan` transformations for two physiologic variables'}
#| label: fig-multivar-physiol-transcan
#| fig-height: 2.75
require(Hmisc)
getHdata(support) # Get data frame from web site
heart.rate <- support$hrt
blood.pressure <- support$meanbp
blood.pressure[400:401]
blood.pressure[400:401] <- NA # Create two missings
d <- data.frame(heart.rate, blood.pressure)
w <- transcan(~ heart.rate + blood.pressure, transformed=TRUE,
imputed=TRUE, show.na=TRUE, data=d, trantab=TRUE, pl=FALSE)
ggplot(w)
w$imputed$blood.pressure
t <- w$transformed
spe <- round(c(spearman(heart.rate, blood.pressure),
spearman(t[,'heart.rate'],
t[,'blood.pressure'])), 2)
```
<!-- ??? TODO
#| fig-cap: !expr paste('Relationship between heart rate and blood pressure. The lower left plot contains raw data (Spearman $\\rho=',spe[1],'$); the lower right is a scatterplot of the corresponding transformed values ($\\rho=',spe[2],'$). Data courtesy of the SUPPORT study [@kna95sup].')
-->
`r this <- 'and that'`
```{r}
#| label: fig-multivar-physiol-transcan2
#| fig-cap: "Relationship between heart rate and blood pressure: untransformed data (left) and transformed data (right). Data courtesy of the SUPPORT study [@kna95sup]."
#| fig-scap: 'HR vs. BP before and after `transcan` transformations'
#| fig-height: 3
spar(mfrow=c(1,2))
plot(heart.rate, blood.pressure)
plot(t[,'heart.rate'], t[,'blood.pressure'],
xlab='Transformed hr', ylab='Transformed bp')
```
ACE (Alternating Conditional Expectation) of @bre85est
1. Uses nonparametric "super smoother" [@fri84] `r ipacue()`
1. Allows monotonicity constraints, categorical vars.
1. Does not handle missing data
* These methods find _marginal_ transformations `r ipacue()`
* Check adequacy of transformations using $Y$
+ Graphical
+ Nonparametric smoothers ($X$ vs. $Y$)
+ Expand original variable using spline, test additional predictive information over original transformation
### Simple Scoring of Variable Clusters
`r mrg(sound("mv-10"))`
* Try to score groups of transformed variables with $PC_1$ `r ipacue()`
* Reduces d.f. by pre-transforming var. and by combining
multiple var.
* Later may want to break group apart, but delete all variables
in groups whose summary scores do not add significant information
* Sometimes simplify cluster score by finding a subset of its
constituent variables which predict it with high $R^2$.
Series of dichotomous variables:
* Construct $X_1$ = 0-1 according to whether any variables positive `r ipacue()`
* Construct $X_2$ = number of positives
* Test whether original variables add to $X_1$ or $X_2$
### Simplifying Cluster Scores
### How Much Data Reduction Is Necessary?
Using Expected Shrinkage to Guide Data Reduction
* Fit full model with all candidates, $p$ d.f., LR likelihood ratio `r ipacue()`
$\chi^2$
* Compute $\hat{\gamma}$
* If $<0.9$, consider shrunken estimator from whole model,
or data reduction (again not using $Y$)
* $q$ regression d.f. for reduced model
* Assume best case: discarded dimensions had no association with $Y$
* Expected loss in LR is $p-q$
* New shrinkage $[{\rm LR} - (p-q) - q]/[{\rm LR} - (p-q)]$
* Solve for $q$ $\rightarrow$ $q \leq ({\rm LR}-p)/9$
* Under these assumptions, no hope unless original LR $> p+9$
* No $\chi^2$ lost by dimension reduction $\rightarrow$ $q \leq {\rm LR}/10$
Example:
`r mrg(sound("mv-11"))`
* Binary logistic model, 45 events on 150 subjects `r ipacue()`
* 10:1 rule $\rightarrow$ analyze 4.5 d.f. total
* Analyst wishes to include age, sex, 10 others
* Not known if age linear or if age and sex additive
* 4 knots $\rightarrow$ $3+1+1$ d.f. for age and sex if
restrict interaction to be linear
* Full model with 15 d.f. has LR=50
* Expected shrinkage factor $(50-15)/50 = 0.7$
* LR$>15+9=24$ $\rightarrow$ reduction may help
* Reduction to $q = (50-15)/9 \approx 4$ d.f. necessary
* Have to assume age linear, reduce other 10 to 1 d.f.
* Separate hypothesis tests intended $\rightarrow$ use full model,
adjust for multiple comparisons
| Goals | Reasons | Methods |
|-------|---------|---------|
| Group predictors so that each group represents a single dimension that can be summarized with a single score | • $\downarrow$ d.f. arising from multiple predictors<br>• Make $PC_1$ more reasonable summary | Variable clustering<br><br>• Subject matter knowledge<br>• Group predictors to maximize proportion of variance explained by $PC_1$ of each group<br>• Hierarchical clustering using a matrix of similarity measures between predictors |
| Transform predictors | • $\downarrow$ d.f.\ due to nonlinear and dummy variable components<br>• Allows predictors to be optimally combined<br>• Make $PC_1$ more reasonable summary<br>• Use in customized model for imputing missing values on each predictor | • Maximum total variance on a group of related predictors<br>• Canonical variates on the total set of predictors |
: Summary of Some Data Reduction Methods
## Other Approaches to Predictive Modeling
## Overly Influential Observations {#sec-multivar-influence}
`r mrg(sound("mv-12"))`
* Every observation should influence fit `r ipacue()`
* Major results should not rest on 1 or 2 obs.
* Overly infl. obs. $\rightarrow$ $\uparrow$ variance of predictions
* Also affects variable selection
Reasons for influence:
* Too few observations for complexity of model (see @sec-multivar-data-reduction, @sec-multivar-variable-selection) `r ipacue()`
* Data transcription or entry errors
* Extreme values of a predictor
1. Sometimes subject so atypical should remove from dataset
1. Sometimes truncate measurements where data density ends
1. Example: $n=4000$, 2000 deaths, white blood count
range 500-100,000, .05,.95 quantiles=2755, 26700
1. Linear spline function fit
1. Sensitive to WBC$>60000\ (n=16)$
1. Predictions stable if truncate WBC to 40000 ($n=46$ above
40000)
* Disagreements between predictors and response. Ignore unless
extreme values or another explanation
* Example: $n=8000$, one extreme predictor value not on straight
line relationship with other $(X,Y)$ $\rightarrow$ $\chi^{2}=36$ for
$H_{0}:$ linearity
Statistical Measures:
* Leverage: capacity to be influential (not necessarily infl.) <br> `r ipacue()`
Diagonals of "hat matrix" $H = X(X'X)^{-1}X'$ --- measures how
an obs. predicts its own response [@bel80]
* $h_{ii}>2(p+1)/n$ may signal a high leverage
point [@bel80]
* DFBETAS: change in $\hat{\beta}$ upon deletion of each obs,
scaled by s.e.
* DFFIT: change in $X\hat{\beta}$ upon deletion of each obs
* DFFITS: DFFIT standardized by s.e. of $\hat{\beta}$
* Some classify obs as overly influential when
$|{\rm DFFITS}|>2\sqrt{(p+1)/(n-p-1)}$ [@bel80]
* Others examine entire distribution for "outliers"
* No substitute for careful examination of data [@cha91; @spe93rem]
* Maximum likelihood estimation requires 1-step approximations
## Comparing Two Models
`r mrg(sound("mv-13"))`
* Level playing field (independent datasets, same no. candidate `r ipacue()`
d.f., careful bootstrapping)
* Criteria:
1. calibration
1. discrimination
1. face validity
1. measurement errors in required predictors
1. use of continuous predictors (which are usually better defined
than categorical ones)
1. omission of "insignificant" variables that nonetheless make
sense as risk factors
1. simplicity (though this is less important with the
availability of computers)
1. lack of fit for specific types of subjects
* Goal is to rank-order: ignore calibration `r ipacue()`
* Otherwise, dismiss a model having poor calibration
* Good calibration $\rightarrow$ compare discrimination
(e.g., $R^{2}$ [@nag91not], model $\chi^2$,
Somers' $D_{xy}$, Spearman's $\rho$, area under ROC curve)
* Worthwhile to compare models on a measure not used to optimize `r ipacue()`
either model, e.g., mean absolute error, median absolute error if
using OLS
* Rank measures may not give enough credit to extreme predictions $\rightarrow$
model $\chi^{2}, R^{2}$, examine extremes of distribution of $\hat{Y}$
* Examine differences in predicted values from the two models
* See [@pee07ext; @pen08eva; @pen11ext; @pen12nov] for discussions
and examples of low power for testing differences in ROC areas, and
for other approaches.
## Improving the Practice of Multivariable Prediction
`r mrg(sound("mv-14"))`
See also @sec-val-habits
@gre00whe discusses many important points:
* Stepwise variable selection on confounders leaves important `r ipacue()`
confounders uncontrolled
* Shrinkage is far superior to variable selection
* Variable selection does more damage to confidence interval
widths than to point estimates
* Claims about unbiasedness of ordinary MLEs are misleading
because they assume the model is correct and is the only model
entertained
* "models need to be complex to capture uncertainty about the
relations ... an honest uncertainty assessment requires parameters
for all effects that we know may be present. This advice is
implicit in an anti-parsimony principle often attributed to
L. J. Savage 'All models should be as big as an elephant' (see Draper, 1995)"
Greenland's example of inadequate adjustment for confounders as a
result of using a bad modeling strategy:
* Case-control study of diet, food constituents, breast cancer `r ipacue()`
* 140 cases, 222 controls
* 35 food constituent intakes and 5 confounders
* Food intakes are correlated
* Traditional stepwise analysis not adjusting simultaneously
for all foods consumed $\rightarrow$ 11 foods had $P < 0.05$
* Full model with all 35 foods competing $\rightarrow$ 2 had $P < 0.05$
* Rigorous simultaneous analysis (hierarchical random slopes
model) penalizing estimates for the number of associations
examined $\rightarrow$ no foods associated with breast cancer
**Global Strategies**
* Use a method known not to work well (e.g., stepwise variable `r ipacue()`
selection without penalization; recursive partitioning), document
how poorly the model
performs (e.g. using the bootstrap), and use the model anyway
* Develop a black box model that performs poorly and is difficult
to interpret (e.g., does not incorporate penalization)
* Develop a black box model that performs well and is difficult to interpret
* Develop interpretable approximations to the black box
* Develop an interpretable model (e.g. give priority to additive
effects) that performs well and is likely to perform equally well
on future data from the same stream
**Preferred Strategy in a Nutshell**
* Decide how many d.f. can be spent `r ipacue()`
* Decide where to spend them
* Spend them
* Don't reconsider, especially if inference needed
## Summary: Possible Modeling Strategies {#sec-multivar-strategy}
### Developing Predictive Models
`r mrg(sound("mv-15"))`
1. Assemble accurate, pertinent data and lots of it, with wide `r ipacue()`
distributions for $X$.
1. Formulate good hypotheses --- specify
relevant candidate predictors and possible interactions. Don't
use $Y$ to decide which $X$'s to include.
1. Characterize subjects with missing $Y$. Delete such subjects
in rare circumstances [@cra95com]. For certain models it
is effective to multiply impute $Y$.
1. Characterize and impute missing $X$. In most cases use
multiple imputation based on $X$ and $Y$
1. For each predictor specify complexity or degree of
nonlinearity that should be allowed (more for important
predictors or for large $n$) (@sec-multivar-complexity)
1. Do data reduction if needed (pre-transformations, combinations),
or use penalized estimation [@har98dev]
1. Use the entire sample in model development
1. Can do highly structured testing to simplify "initial" model
+ Test entire group of predictors with a single $P$-value
+ Make each continuous predictor have same number of knots,
and select the number that optimizes AIC
+ Test the combined effects of all nonlinear terms with a
single $P$-value
1. Make tests of linearity of effects in the model only to
demonstrate to others that such effects are often statistically
significant. Don't remove individual insignificant effects from the model.
1. Check additivity assumptions by testing pre-specified
interaction terms. Use a global test and either keep all or
delete all interactions.
1. Check to see if there are overly-influential observations.
1. Check distributional assumptions and choose a different model
if needed.
1. Do limited backwards step-down variable selection if parsimony
is more important that accuracy [@spi86]. But confidence
limits, etc., must account for variable selection (e.g., bootstrap).
1. This is the "final" model.
1. Interpret the model graphically and by computing predicted
values and appropriate test statistics. Compute pooled tests of
association for collinear predictors.
1. Validate this model for calibration and discrimination ability,
preferably using bootstrapping.
1. Shrink parameter estimates if there is overfitting but no
further data reduction is desired (unless shrinkage built-in to
estimation)
1. When missing values were imputed, adjust final
variance-covariance matrix
for imputation. Do this as early as possible because
it will affect other findings.
1. When all steps of the modeling strategy can be automated,
consider using @far92cos to penalize for
the randomness inherent in the multiple steps.
1. Develop simplifications to the final model as needed.
<!-- NEW confounding 2023-05-30 -->
### Developing Models for Effect Estimation
`r mrg(sound("mv-16"))`
1. Less need for parsimony; even less need to remove insignificant `r ipacue()`
variables from model (otherwise CLs too narrow)
1. Careful consideration of interactions; inclusion forces
estimates to be conditional and raises variances
1. If variable of interest is mostly the one that is missing,
multiple imputation less valuable
1. Complexity of main variable specified by prior beliefs,
compromise between variance and bias
1. Don't penalize terms for variable of interest
1. Model validation less necessary
1. Like the hypothesis testing scenario covered next, when one is attempting to apply a causal interpretation to the effect being estimated, consideration of pathways and confounding are all-important. Collider bias and backdoor pathways can ruin interpretation, and so can the omission of confounder variables that would have provided alternate explanations for observed effects of interest. In pure prediction mode we don't care very much how we got here, but in isolating the effect of a specific variable in the model, e.g., in estimating the effect of treatment in an observational treatment comparison study, measuring and adjusting for nearly all confounding in play are very important. Sometimes adjusting for confounding requires overfitting the model from a prediction standpoint, but that is OK. Overfitting can result in widering confidence intervals for the effect of interest, and that is a proper price to pay.
### Developing Models for Hypothesis Testing
1. Virtually same as previous strategy `r ipacue()`
1. Interactions require tests of effect by varying values of
another variable, or "main effect + interaction" joint tests
(e.g., is treatment effective for either sex, allowing effects to be
different)
1. Validation may help quantify over-adjustment
## Study Questions
**Section 4.1**
1. An investigator is interested in finding risk factors for stroke.
She has collected data on 3000 subjects of whom 70 were known to
suffer a stroke within 5 years from study entry. She has decided
that race (5 possible levels), age, height, and mean arterial blood
pressure should definitely be in the model. She does not try to
find previous studies that show how these potential risk factors
relate to the risk of stroke. What is one way to decide how many
degrees of freedom to spend on each predictor? How can one achieve
these d.f. (whatever they are) for race and for a continuous variable?
1. What is the driving force behind the approach used here to prespecify the complexity with which each predictor is specified in the regression model?
1. Why is the process fair?
1. When feasible to do, why is learning from a saturated model preferred?
**Section 4.3**
1. What did we already study that variable selection is an extension of?
1. What does variable selection ruin more than it ruins regression coefficient estimates?
1. Besides those problems what's the worst thing you can say about statistical test-based variable selection?
1. In which situation would variable selection have a chance to be helpful?
1. What is the statistical point that the Maxwell demon analogy is trying to bring out?
1. What is the single root cause of problems that is in common to each of the
following strategies?
+ The investigator retains race as 5 categories and fits it with 4
dummy variables. She examines coefficients from these dummy
variables to decide which races have similar coefficients and should
be pooled.
+ All continuous variables are allowed to be nonlinear using
splines, and variables having insignificant nonlinear terms are
re-fitted as linear.
+ All potential risk factors are used as candidate predictor
variables with forward stepwise variable selection with a
significance level for entering the model of $\alpha=0.1$.
+ A clinical trial of 5 treatments find that 3 of the treatments
result in nearly the same blood pressure reduction. These 3
treatments are pooled and their overall mean is compared with each
of the remaining 2 treatments.
1. Name at least one specific harm done by the strategies outlined in the
last question, in general terms (not specific to each strategy).
**Section 4.4**
1. Describe a major way that "breaking ties" in Y, i.e., having it more continuous, results in more power and precision in modeling.
1. What is the effective sample size in a study with 10 cancer outcomes and 100,000 normal outcomes?
**Section 4.5**
1. What are main causes of passive shrinkage/regression to the mean?
1. Why does the heuristic shrinkage estimate work?
1. What causes the graph of $\hat{Y}$ (on $x$-axis) vs.\ $Y$
(on $y$-axis; you could call this $Y_{new}$) to be flatter than a
$45^\circ$ line when the graph is made for new data not used to fit
the model?
**Section 4.6**
1. What causes the standard error of a regression coefficient to get large when adjusting for a competing predictor?
1. When just wanting to measure strength of association or test a group of predictors, what is a simple fix for co-linearity?
**Section 4.7**
1. What is the purpose of data reduction, and why does it not create
phantom degrees of freedom that need to be accounted for later in
the modeling process?
1. Name several data reduction methods for reducing the number of
regression coefficients to estimate.
1. Principal components can have unstable loadings (e.g., if you bootstrap). Why are they still a valuable component in a data reduction strategy?
1. What is the overall guidance for knowing how much unsupervised learning (data reduction) to do?
**Section 4.9**
1. Why are high leverage (far out) observations influential on regression coefficients even when Y is binary?
**Section 4.10**
1. What would make one model having much better predictive discrimination than another irrelevant?
**Section 4.12**
1. Are the steps to the default strategy complete and is the sequencing "correct"?
```{r echo=FALSE}
saveCap('04')
```
```