4  Multivariable Modeling Strategies


There are many choices to be made when deciding upon a global modeling strategy, including choice between


4.1 Prespecification of Predictor Complexity Without Later Simplification

  • Rarely expect linearity
  • Can’t always use graphs or other devices to choose transformation
  • If select from among many transformations, results biased
  • Need to allow flexible nonlinearity to potentially strong predictors not known to predict linearly
  • Once decide a predictor is “in” can choose no. of parameters to devote to it using a general association index with \(Y\)
  • Need a measure of “potential predictive punch”
  • Measure needs to mask analyst to true form of regression to preserve statistical properties

Motivating examples:

# Overfitting a flat relationship
x <- runif(1000)
y <- runif(1000, -0.5, 0.5)
dd <- datadist(x, y); options(datadist='dd')
par(mfrow=c(2,2), mar=c(2, 2, 3, 0.5))
pp <- function(actual) {
  yhat  <- predict(f, data.frame(x=xs))
  yreal <- actual(xs)
  plot(0, 0, xlim=c(0,1),
       ylim=range(c(quantile(y, c(0.1, 0.9)), yhat,
       type='n', axes=FALSE)
  axis(1, labels=FALSE); axis(2, labels=FALSE)
  lines(xs, yreal)
  lines(xs, yhat, col='blue')
f <- ols(y ~ rcs(x, 5))
xs <- seq(0, 1, length=150)
pp(function(x) 0*x)
title('Mild Error:\nOverfitting a Flat Relationship',
y <- x + runif(1000, -0.5, 0.5)
f <- ols(y ~ rcs(x, 5))
pp(function(x) x)
title('Mild Error:\nOverfitting a Linear Relationship',
y <- x^4 + runif(1000, -1, 1)
f <- ols(y ~ x)
pp(function(x) x^4)
title('Serious Error:\nUnderfitting a Steep Relationship',
y <- - (x - 0.5) ^ 2 + runif(1000, -0.2, 0.2)
f <- ols(y ~ x)
pp(function(x) - (x - 0.5) ^ 2)
title('Tragic Error:\nMonotonic Fit to\nNon-Monotonic Relationship',

Figure 4.1: Fitting errors to withstand or to avoid

Examples of Reducing the Number of Parameters
Categorical predictor with \(k\) levels Collapse less frequent categories into “other”
Continuous predictor represented as \(k\)-knot restricted cubic spline Reduce \(k\) to a number as low as 3, or 0 (linear)


4.1.1 Learning From a Saturated Model

When the effective sample size available is sufficiently large so that a saturated main effects model may be fitted, a good approach to gauging predictive potential is the following.

  • Let all continuous predictors be represented as restricted cubic splines with \(k\) knots, where \(k\) is the maximum number of knots the analyst entertains for the current problem.
  • Let all categorical predictors retain their original categories except for pooling of very low prevalence categories (e.g., ones containing \(<6\) observations).
  • Fit this general main effects model.
  • Compute the partial \(\chi^2\) statistic for testing the association of each predictor with the response, adjusted for all other predictors. In the case of ordinary regression convert partial \(F\) statistics to \(\chi^2\) statistics or partial \(R^2\) values.
  • Make corrections for chance associations to “level the playing field” for predictors having greatly varying d.f., e.g., subtract the d.f. from the partial \(\chi^2\) (the expected value of \(\chi^{2}_{p}\) is \(p\) under \(H_{0}\)).
  • Make certain that tests of nonlinearity are not revealed as this would bias the analyst.
  • Sort the partial association statistics in descending order.

Commands in the rms package can be used to plot only what is needed. Here is an example for a logistic model.

f <- lrm(y ~ sex + race + rcs(age,5) + rcs(weight,5) +
         rcs(height,5) + rcs(blood.pressure,5))

4.1.2 Using Marginal Generalized Rank Correlations

When collinearities or confounding are not problematic, a quicker approach based on pairwise measures of association can be useful. This approach will not have numerical problems (e.g., singular covariance matrix) and is based on:

  • 2 d.f. generalization of Spearman \(\rho\)\(R^2\) based on \(rank(X)\) and \(rank(X)^2\) vs. \(rank(Y)\)
  • \(\rho^2\) can detect U-shaped relationships
  • For categorical \(X\), \(\rho^2\) is \(R^2\) from dummy variables regressed against \(rank(Y)\); this is tightly related to the Wilcoxon-Mann-Whitney-Kruskal-Wallis rank test for group differences1
  • Sort variables by descending order of \(\rho^2\)
  • Specify number of knots for continuous \(X\), combine infrequent categories of categorical \(X\) based on \(\rho^2\)
  • 1 This test statistic does not inform the analyst of which groups are different from one another.

  • Allocating d.f. based on partial tests of association or sorting \(\rho^2\) is a fair procedure because

    • We already decided to keep variable in model no matter what \(\rho^2\) or \(\chi^2\) values are seen
    • \(\rho^2\) and \(\chi^2\) do not reveal degree of nonlinearity; high value may be due solely to strong linear effect
    • low \(\rho^2\) or \(\chi^2\) for a categorical variable might lead to collapsing the most disparate categories

    Initial simulations show the procedure to be conservative. Note that one can move from simpler to more complex models but not the other way round

    4.2 Checking Assumptions of Multiple Predictors Simultaneously

    • Sometimes failure to adjust for other variables gives wrong transformation of an \(X\), or wrong significance of interactions
    • Sometimes unwieldy to deal simultaneously with all predictors at each stage \(\rightarrow\) assess regression assumptions separately for each predictor

    4.3 Variable Selection

    • Series of potential predictors with no prior knowledge
    • \(\uparrow\) exploration \(\rightarrow\) \(\uparrow\) shrinkage (overfitting)
    • Summary of problem: \(E(\hat{\beta} | \hat{\beta}\) “significant” \() \neq \beta\) (Chatfield, 1995)
    • Biased \(R^2\), \(\hat{\beta}\), standard errors, \(P\)-values too small
    • \(F\) and \(\chi^2\) statistics do not have the claimed distribution2 (Grambsch & O’Brien, 1991)
    • Will result in residual confounding if use variable selection to find confounders (Greenland, 2000)
    • Derksen & Keselman (1992) found that in stepwise analyses the final model represented noise 0.20-0.74 of time, final model usually contained \(< \frac{1}{2}\) actual number of authentic predictors. Also:
  • 2 Lockhart et al. (2013) provide an example with \(n=100\) and 10 orthogonal predictors where all true \(\beta\)s are zero. The test statistic for the first variable to enter has type I assertion probability of 0.39 when the nominal \(\alpha\) is set to 0.05.

  • K
    1. The degree of correlation between the predictor variables affected the frequency with which authentic predictor variables found their way into the final model.
    2. The number of candidate predictor variables affected the number of noise variables that gained entry to the model.
    3. The size of the sample was of little practical importance in determining the number of authentic variables contained in the final model.
    4. The population multiple coefficient of determination could be faithfully estimated by adopting a statistic that is adjusted by the total number of candidate predictor variables rather than the number of variables in the final model.
    • Global test with \(p\) d.f. insignificant \(\rightarrow\) stop

    Simulation experiment, true \(\sigma^{2} = 6.25\), 8 candidate variables, 4 of them related to \(Y\) in the population. Select best model using all possible subsets regression to maximize \(R^{2}_{adj}\) (all possible subsets is not usually recommended but gives variable selection more of a chance to work in this context).


    Note: The audio was made using stepAIC with collinearities in predictors. The code below allows for several options. Here we use all possible subsets of predictors and force predictors to be uncorrelated, which is the easiest case for variable selection.

    sim <- function(n, sigma=2.5, method=c('stepaic', 'leaps'),
                    pr=FALSE, prcor=FALSE, dataonly=FALSE) {
      method <- match.arg(method)
      if(uncorrelated) {
        x1 <- rnorm(n)
        x2 <- rnorm(n)
        x3 <- rnorm(n)
        x4 <- rnorm(n)
        x5 <- rnorm(n)
        x6 <- rnorm(n)
        x7 <- rnorm(n)
        x8 <- rnorm(n)
      else {
          x1 <- rnorm(n)
          x2 <- x1 + 2.0 * rnorm(n)       # was + 0.5 * rnorm(n)
          x3 <- rnorm(n)
          x4 <- x3 + 1.5 * rnorm(n)
          x5 <- x1 + rnorm(n)/1.3
          x6 <- x2 + 2.25 * rnorm(n)      # was rnorm(n)/1.3
          x7 <- x3 + x4 + 2.5 * rnorm(n)  # was + rnorm(n)
          x8 <- x7 + 4.0 * rnorm(n)       # was + 0.5 * rnorm(n)
      z <- cbind(x1,x2,x3,x4,x5,x6,x7,x8)
      if(prcor) return(round(cor(z), 2))
      lp <- x1 + x2 + .5*x3 + .4*x7
      y <- lp + sigma*rnorm(n)
      if(dataonly) return(list(x=z, y=y))
      if(method == 'leaps') {
        s     <- summary(regsubsets(z, y))
        best  <- which.max(s$adjr2)
        xvars <- s$which[best, -1]   # remove intercept
        ssr   <- s$rss[best]
        p     <- sum(xvars)
        xs    <- if(p == 0) 'none' else paste((1 : 8)[xvars], collapse='')
        if(pr) print(xs)
        ssesw <- (n - 1) * var(y) - ssr
        s2s   <- ssesw / (n - p - 1)
        yhat  <- if(p == 0) mean(y) else fitted(lm(y ~ z[, xvars]))
      f <- lm(y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8)
      if(method == 'stepaic') {
        g <- stepAIC(f, trace=0)
        p <- g$rank - 1
        xs <- if(p == 0) 'none' else
          gsub('[ <br>+x]','', as.character(formula(g))[3])
        if(pr) print(formula(g), showEnv=FALSE)
        ssesw <- sum(resid(g)^2)
        s2s   <- ssesw/g$df.residual
        yhat  <- fitted(g)
      # Set SSEsw / (n - gdf - 1) = true sigma^2
      gdf <- n - 1 - ssesw / (sigma^2)
      # Compute root mean squared error against true linear predictor
      rmse.full <- sqrt(mean((fitted(f) - lp) ^ 2))
      rmse.step <- sqrt(mean((yhat - lp) ^ 2))
      list(stats=c(n=n, vratio=s2s/(sigma^2),
             gdf=gdf, apparentdf=p, rmse.full=rmse.full, rmse.step=rmse.step),
    rsim <- function(B, n, method=c('stepaic', 'leaps')) {
      method <- match.arg(method)
      xs <- character(B)
      r <- matrix(NA, nrow=B, ncol=6)
      for(i in 1:B) {
        w     <- sim(n, method=method)
        r[i,] <- w$stats
        xs[i] <- w$xselected
      colnames(r) <- names(w$stats)
      s <- apply(r, 2, median)
      p <- r[, 'apparentdf']
      s['apparentdf'] <- mean(p)
      print(round(s, 2))
      cat('Prob[correct model]=', round(sum(xs == '1237')/B, 2), '\n')

    Show the correlation matrix being assumed for the \(X\)s:

    uncorrelated <- TRUE
    sim(50000, prcor=TRUE)
          x1   x2    x3   x4   x5   x6   x7    x8
    x1  1.00 0.00 -0.01 0.00 0.01 0.01 0.00  0.01
    x2  0.00 1.00  0.00 0.00 0.01 0.00 0.00  0.00
    x3 -0.01 0.00  1.00 0.00 0.00 0.00 0.00 -0.01
    x4  0.00 0.00  0.00 1.00 0.01 0.00 0.00  0.01
    x5  0.01 0.01  0.00 0.01 1.00 0.01 0.00  0.00
    x6  0.01 0.00  0.00 0.00 0.01 1.00 0.00  0.00
    x7  0.00 0.00  0.00 0.00 0.00 0.00 1.00  0.01
    x8  0.01 0.00 -0.01 0.01 0.00 0.00 0.01  1.00

    Simulate to find the distribution of the number of variables selected, the proportion of simulations in which the true model (\(X_{1}, X_{2}, X_{3}, X_{7}\)) was found, the median value of \(\hat{\sigma}^{2}/\sigma^{2}\), the median effective d.f., and the mean number of apparent d.f., for varying sample sizes.

    m <- 'leaps'             # all possible regressions stopping on R2adj
    rsim(100, 20, method=m)  # actual model found twice out of 100
             n     vratio        gdf apparentdf  rmse.full  rmse.step 
         20.00       0.94       5.32       4.10       1.62       1.58 
     1  2  3  4  5  6  7  8 
     3 14 18 22 27 11  4  1 
    Prob[correct model]= 0.02 
    rsim(100, 40, method=m)
             n     vratio        gdf apparentdf  rmse.full  rmse.step 
         40.00       0.61      17.89       4.38       1.21       1.24 
     2  3  4  5  6  7 
     9 18 24 29 15  5 
    Prob[correct model]= 0.04 
    rsim(100, 150, method=m)
             n     vratio        gdf apparentdf  rmse.full  rmse.step 
        150.00       0.44      85.99       5.01       0.59       0.57 
     2  3  4  5  6  7  8 
     1  5 27 35 24  7  1 
    Prob[correct model]= 0.2 
    rsim(100, 300, method=m)
             n     vratio        gdf apparentdf  rmse.full  rmse.step 
        300.00       0.42     177.01       5.16       0.43       0.40 
     4  5  6  7  8 
    27 42 20 10  1 
    Prob[correct model]= 0.26 
    rsim(100, 2000)
             n     vratio        gdf apparentdf  rmse.full  rmse.step 
       2000.00       1.00       6.43       4.58       0.17       0.15 
     4  5  6  7 
    53 37  9  1 
    Prob[correct model]= 0.53 

    As \(n\uparrow\) the mean number of variables selected increased. The proportion of simulations in which the correct model was found increased from 0 to 0.53. \(\sigma^{2}\) is underestimated when \(n=300\) by a factor of 0.42, resulting in the d.f. needed to de-bias \(\hat{\sigma^{2}}\) being greater than \(n\) when the apparent d.f. was only 5.16 on the average. Variable selection slightly increased closeness to the true \(X\beta\).

    If the simulations are re-run allowing for collinearities (uncorrelated=FALSE) one can expect variable selection to be even more problematic.

    Variable selection methods (F. E. Harrell, 1986):

    • Forward selection, backward elimination
    • Stopping rule: “residual \(\chi^{2}\)” with d.f. = no. candidates remaining at current step
    • Test for significance or use Akaike’s information criterion (AIC (Atkinson, 1980)), here \(\chi^{2}-2 \times d.f.\)
    • Better to use subject matter knowledge!
    • No currently available stopping rule was developed for stepwise, only for comparing a limited number of pre-specified models[bre92lit, Section 1.3]
    • Roecker (1991) studied forward selection (FS), all possible subsets selection (APS), full fits
    • APS more likely to select smaller, less accurate models than FS
    • Neither as accurate as full model fit unless more than half of candidate variables redundant or unnecessary
    • Step-down is usually better than forward (Mantel, 1970) and can be used efficiently with maximum likelihood estimation (Lawless & Singhal, 1978)
    • Fruitless to try different stepwise methods to look for agreement (Wiegand, 2010)
    • Bootstrap can help decide between full and reduced model
    • Full model fits gives meaningful confidence intervals with standard formulas, C.I. after stepwise does not (Altman & Andersen, 1989; Leo Breiman, 1992; Hurvich & Tsai, 1990)
    • Data reduction (grouping variables) can help
    • Using the bootstrap to select important variables for inclusion in the final model (Sauerbrei & Schumacher, 1992) is problematic (Austin, 2008)
    • It is not logical that a population regression coefficient would be exactly zero just because its estimate was “insignificant”

    See also these articles:


    4.3.1 Maxwell’s Demon as an Analogy to Variable Selection

    Some of the information in the data is spent on variable selection instead of using all information for estimation.

    Model specification is preferred to model selection.

    Information content of the data usually insufficient for reliable variable selection.

    James Clerk Maxwell

    Maxwell imagines one container divided into two parts, A and B. Both parts are filled with the same gas at equal temperatures and placed next to each other. Observing the molecules on both sides, an imaginary demon guards a trapdoor between the two parts. When a faster-than-average molecule from A flies towards the trapdoor, the demon opens it, and the molecule will fly from A to B. Likewise, when a slower-than-average molecule from B flies towards the trapdoor, the demon will let it pass from B to A. The average speed of the molecules in B will have increased while in A they will have slowed down on average. Since average molecular speed corresponds to temperature, the temperature decreases in A and increases in B, contrary to the second law of thermodynamics.

    Szilárd pointed out that a real-life Maxwell’s demon would need to have some means of measuring molecular speed, and that the act of acquiring information would require an expenditure of energy. Since the demon and the gas are interacting, we must consider the total entropy of the gas and the demon combined. The expenditure of energy by the demon will cause an increase in the entropy of the demon, which will be larger than the lowering of the entropy of the gas.

    Source: commons.wikimedia.org/wiki/File:YoungJamesClerkMaxwell.jpg

    Peter Ellis’ blog article contains excellent examples of issues discussed here but applied to time series modeling.

    4.4 Overfitting and Limits on Number of Predictors

    • Concerned with avoiding overfitting
    • Assume typical problem in medicine, epidemiology, and the social sciences in which the signal:noise ratio is small (higher ratios allow for more aggressive modeling)
    • \(p\) should be \(< \frac{m}{15}\) (F. E. Harrell et al., 1984, 1985; Peduzzi et al., 1995, 1996; Smith et al., 1992; van der Ploeg et al., 2014; Vittinghoff & McCulloch, 2006)
    • $p= $ number of parameters in full model or number of candidate parameters in a stepwise analysis
    • Derived from simulations to find minimum sample size so that apparent discrimination = validated discrimination
    • Applies to typical signal:noise ratios found outside of tightly controlled experiments
    • If true \(R^{2}\) is high, many parameters can be estimated from smaller samples
    • Ignores sample size needed just to estimate the intercept or, in semiparametric models, the underlying distribution function3
    • Riley, Snell, Ensor, Burke, Harrell, et al. (2019) and Riley, Snell, Ensor, Burke, Jr, et al. (2019) have refined sample size estimation for continuous, binary, and time-to-event models to account for all of these issues
    • To just estimate \(\sigma\) in a linear model with a multiplicative margin of error of 1.2 with 0.95 confidence requires \(n=70\)
  • 3 The sample size needed for these is model-dependent

  • S
    Limiting Sample Sizes for Various Response Variable Types
    Type of Response Variable Limiting Sample Size \(m\)
    Continuous \(n\) (total sample size)
    Binary \(\min(n_{1}, n_{2})\) 4
    Ordinal (\(k\) categories) \(n - \frac{1}{n^{2}} \sum_{i=1}^{k} n_{i}^{3}\) 5
    Failure (survival) time number of failures 6
  • 4 If one considers the power of a two-sample binomial test compared with a Wilcoxon test if the response could be made continuous and the proportional odds assumption holds, the effective sample size for a binary response is \(3 n_{1}n_{2}/n \approx 3 \min(n_{1}, n_{2})\) if \(\frac{n_{1}}{n}\) is near 0 or 1 (Whitehead, 1993, Eq. 10, 15). Here \(n_{1}\) and \(n_{2}\) are the marginal frequencies of the two response levels (Peduzzi et al., 1996).

  • 5 Based on the power of a proportional odds model two-sample test when the marginal cell sizes for the response are \(n_{1}, \ldots, n_{k}\), compared with all cell sizes equal to unity (response is continuous) (Whitehead, 1993, Eq. 3). If all cell sizes are equal, the relative efficiency of having \(k\) response categories compared to a continuous response is \(1 - \frac{1}{k^{2}}\) (Whitehead, 1993, Eq. 14), e.g., a 5-level response is almost as efficient as a continuous one if proportional odds holds across category cutoffs.

  • 6 This is approximate, as the effective sample size may sometimes be boosted somewhat by censored observations, especially for non-proportional hazards methods such as Wilcoxon-type tests (Benedetti et al., 1982).

    • Narrowly distributed predictor \(\rightarrow\) even higher \(n\)
    • \(p\) includes all variables screened for association with response, including interactions
    • Univariable screening (graphs, crosstabs, etc.) {} reduces multiple comparison problems of model building (Sun et al., 1996)

    4.5 Shrinkage

    • Slope of calibration plot; regression to the mean
    • Statistical estimation procedure — “pre-shrunk” models
    • Aren’t regression coefficients OK because they’re unbiased?
    • Problem is in how we use coefficient estimates
    • Consider 20 samples of size \(n=50\) from \(U(0,1)\)
    • Compute group means and plot in ascending order
    • Equivalent to fitting an intercept and 19 dummies using least squares
    • Result generalizes to general problems in plotting \(Y\) vs.
    n <- 50
    y <- runif(20*n)
    group <- rep(1:20,each=n)
    ybar <- tapply(y, group, mean)
    ybar <- sort(ybar)
    plot(1:20, ybar, type='n', axes=FALSE, ylim=c(.3,.7),
         xlab='Group', ylab='Group Mean')
    lines(1:20, ybar)
    points(1:20, ybar, pch=20, cex=.5)
    axis(1, at=1:20, labels=FALSE)
    for(j in 1:20) axis(1, at=j, labels=names(ybar)[j])

    Figure 4.2: Sorted means from 20 samples of size 50 from a uniform \([0,1]\) distribution. The reference line at 0.5 depicts the true population value of all of the means.

  • 7 An excellent discussion about such indexes may be found here.

  • 4.6 Collinearity

    • When at least 1 predictor can be predicted well from others
    • Can be a blessing (data reduction, transformations)
    • \(\uparrow\) s.e. of \(\hat{\beta}\), \(\downarrow\) power
    • This is appropriate \(\rightarrow\) asking too much of the data (Chatterjee & Hadi, 2012, Chapter 9)
    • Variables compete in variable selection, chosen one arbitrary
    • Does not affect joint influence of a set of highly correlated variables (use multiple d.f. tests)
    • Does not at all affect predictions on model construction sample
    • Does not affect predictions on new data (Myers, 1990, pp. 379–381) if
      • Extreme extrapolation not attempted
      • New data have same type of collinearities as original data
    • Example: LDL and total cholesterol – problem only if more inconsistent in new data
    • Example: age and age\(^{2}\) – no problem
    • One way to quantify for each predictor: variance inflation factors (VIF)
    • General approach (maximum likelihood) — transform information matrix to correlation form, VIF=diagonal of inverse (Davis et al., 1986; Wax, 1992)
    • See David A. Belsley (1991), pp. 28-30 for problems with VIF
    • Easy approach: SAS VARCLUS procedure (Sarle, 1990), R varclus function, other clustering techniques: group highly correlated variables
    • Can score each group (e.g., first principal component, \(PC_1\) (D’Agostino et al., 1995)); summary scores not collinear

    4.7 Data Reduction

    • Unless \(n>>p\), model unlikely to validate
    • Data reduction: \(\downarrow p\)
    • Use the literature to eliminate unimportant variables.
    • Eliminate variables whose distributions are too narrow.
    • Eliminate candidate predictors that are missing in a large number of subjects, especially if those same predictors are likely to be missing for future applications of the model.
    • Use a statistical data reduction method such as incomplete principal components regression, nonlinear generalizations of principal components such as principal surfaces, sliced inverse regression, variable clustering, or ordinary cluster analysis on a measure of similarity between variables.
    • Data reduction is completely masked to \(Y\), which is precisely why it does not distort estimates, standard errors, \(P\)-values, or confidence limits
    • Data reduction = unsupervised learning
    • Example: dataset with 40 events and 60 candidate predictors
      • Use variable clustering to group variables by correlation structure
      • Use clinical knowledge to refine the clusters
      • Keep age and severity of disease as separate predictors because of their strength
      • For others create clusters: socioeconomic, risk factors/history, and physiologic function
      • Summarize each cluster with its first principal component \(PC_{1}\), i.e., the linear combination of characteristics that maximizes variance of the score across subjects subject to an overall constraint on the coefficients
      • Fit outcome model with 5 predictors

    4.7.1 Redundancy Analysis

    • Remove variables that have poor distributions
      • E.g., categorical variables with fewer than 2 categories having at least 20 observations
    • Use flexible additive parametric additive models to determine how well each variable can be predicted from the remaining variables
    • Variables dropped in stepwise fashion, removing the most predictable variable at each step
    • Remaining variables used to predict
    • Process continues until no variable still in the list of predictors can be predicted with an \(R^2\) or adjusted \(R^2\) greater than a specified threshold or until dropping the variable with the highest \(R^2\) (adjusted or ordinary) would cause a variable that was dropped earlier to no longer be predicted at the threshold from the now smaller list of predictors
    • R function redun in Hmisc package
    • Related to principal variables (McCabe, 1984) but faster

    4.7.2 Variable Clustering

    • Goal: Separate variables into groups
      • variables within group correlated with each other
      • variables not correlated with non-group members
    • Score each dimension, stop trying to separate effects of factors measuring same phenomenon
    • Variable clustering (D’Agostino et al., 1995; Sarle, 1990) (oblique-rotation PC analysis) \(\rightarrow\) separate variables so that first PC is representative of group
    • Can also do hierarchical cluster analysis on similarity matrix based on squared Spearman or Pearson correlations, or more generally, Hoeffding’s \(D\) (Hoeffding, 1948).
    • See Guo et al. (2011) for a method related to variable clustering and sparse principal components.
    • Chavent et al. (2012) implement many more variable clustering methods

    Example: Figure 15.6

    4.7.3 Transformation and Scaling Variables Without Using \(Y\)

    • Reduce \(p\) by estimating transformations using associations with other predictors

    • Purely categorical predictors – correspondence analysis (Ciampi et al., 1986; Crichton & Hinde, 1989; Greenacre, 1988; Leclerc et al., 1988; Michailidis & de Leeuw, 1998)

    • Mixture of qualitative and continuous variables: qualitative principal components

    • Maximum total variance (MTV) of Young, Takane, de Leeuw (Michailidis & de Leeuw, 1998; Young et al., 1978)

      1. Compute \(PC_1\) of variables using correlation matrix
      2. Use regression (with splines, dummies, etc.) to predict \(PC_1\) from each \(X\) — expand each \(X_j\) and regress it separately on \(PC_1\) to get working transformations
      3. Recompute \(PC_1\) on transformed \(X\)s
      4. Repeat 3-4 times until variation explained by \(PC_1\) plateaus and transformations stabilize
    • Maximum generalized variance (MGV) method of Sarle (Kuhfeld, 2009, pp. 1267–1268)

      1. Predict each variable from (current transformations of) all other variables
      2. For each variable, expand it into linear and nonlinear terms or dummies, compute first canonical variate
      3. For example, if there are only two variables \(X_{1}\) and \(X_{2}\) represented as quadratic polynomials, solve for \(a,b,c,d\) such that \(aX_{1}+bX_{1}^{2}\) has maximum correlation with \(cX_{2}+dX_{2}^{2}\).
      4. Goal is to transform each var. so that it is most similar to predictions from other transformed variables
      5. Does not rely on PCs or variable clustering
    • MTV (PC-based instead of canonical var.) and MGV implemented in SAS PROC PRINQUAL (Kuhfeld, 2009)

      1. Allows flexible transformations including monotonic splines
      2. Does not allow restricted cubic splines, so may be unstable unless monotonicity assumed
      3. Allows simultaneous imputation but often yields wild estimates

    4.7.4 Simultaneous Transformation and Imputation

    R transcan Function for Data Reduction & Imputation

    • Initialize missings to medians (or most frequent category)

    • Initialize transformations to original variables

    • Take each variable in turn as \(Y\)

    • Exclude obs. missing on \(Y\)

    • Expand \(Y\) (spline or dummy variables)

    • Score (transform \(Y\)) using first canonical variate

    • Missing \(Y\) \(\rightarrow\) predict canonical variate from \(X\)s

    • The imputed values can optionally be shrunk to avoid overfitting for small \(n\) or large \(p\)

    • Constrain imputed values to be in range of non-imputed ones

    • Imputations on original scale

      • Continuous \(\rightarrow\) back-solve with linear interpolation
      • Categorical \(\rightarrow\) classification tree (most freq. cat.) or match to category whose canonical score is closest to one predicted
    • Multiple imputation — bootstrap or approx. Bayesian bootstrap.

      • Sample residuals multiple times (default \(M=5\))
      • Are on “optimally” transformed scale
      • Back-transform
      • fit.mult.impute works with aregImpute and transcan output to easily get imputation-corrected variances and avg. \(\hat{\beta}\)
    • Option to insert constants as imputed values (ignored during transformation estimation); helpful when a lab value may be missing because the patient returned to normal

    • Imputations and transformed values may be easily obtained for new data

    • An R function Function will create a series of R functions that transform each predictor

    • Example: \(n=415\) acutely ill patients

      • Relate heart rate to mean arterial blood pressure
      • Two blood pressures missing
      • Heart rate not monotonically related to blood pressure
      • See Figure 4.3 and Figure 4.4
    getHdata(support)   # Get data frame from web site
    heart.rate     <- support$hrt
    blood.pressure <- support$meanbp
    Mean Arterial Blood Pressure Day 3 
    [1] 151 136
    blood.pressure[400:401] <- NA  # Create two missings
    d <- data.frame(heart.rate, blood.pressure)
    spar(mfrow=c(1,2), ps=9, pch=46)
    w <- transcan(~ heart.rate + blood.pressure, transformed=TRUE,
                  imputed=TRUE, show.na=TRUE, data=d)
    Convergence criterion:2.901 0.035 
    Convergence in 4 iterations
    R-squared achieved in predicting each variable:
        heart.rate blood.pressure 
             0.259          0.259 
    Adjusted R-squared:
        heart.rate blood.pressure 
             0.254          0.253 
         400      401 
    132.4057 109.7741 
    t <- w$transformed
    spe <- round(c(spearman(heart.rate, blood.pressure),
                            t[,'blood.pressure'])), 2)

    Figure 4.3: Transformations fitted using transcan. Tick marks indicate the two imputed values for blood pressure.

    spar(ps=9, mfrow=c(1,2))
    plot(heart.rate, blood.pressure)
    plot(t[,'heart.rate'], t[,'blood.pressure'],
         xlab='Transformed hr', ylab='Transformed bp')

    Figure 4.4: Relationship between heart rate and blood pressure: untransformed data (left) and transformed data (right). Data courtesy of the SUPPORT study (Knaus et al., 1995).

    ACE (Alternating Conditional Expectation) of L. Breiman & Friedman (1985)

    1. Uses nonparametric “super smoother” (Friedman, 1984)
    2. Allows monotonicity constraints, categorical vars.
    3. Does not handle missing data
    • These methods find marginal transformations

    • Check adequacy of transformations using \(Y\)

      • Graphical
      • Nonparametric smoothers (\(X\) vs. \(Y\))
      • Expand original variable using spline, test additional predictive information over original transformation

    4.7.5 Simple Scoring of Variable Clusters

    • Try to score groups of transformed variables with \(PC_1\)
    • Reduces d.f. by pre-transforming var. and by combining multiple var.
    • Later may want to break group apart, but delete all variables in groups whose summary scores do not add significant information
    • Sometimes simplify cluster score by finding a subset of its constituent variables which predict it with high \(R^2\).

    Series of dichotomous variables:

    • Construct \(X_1\) = 0-1 according to whether any variables positive
    • Construct \(X_2\) = number of positives
    • Test whether original variables add to \(X_1\) or \(X_2\)

    4.7.6 Simplifying Cluster Scores

    4.7.7 How Much Data Reduction Is Necessary?

    Using Expected Shrinkage to Guide Data Reduction

    • Fit full model with all candidates, \(p\) d.f., LR likelihood ratio \(\chi^2\)
    • Compute \(\hat{\gamma}\)
    • If \(<0.9\), consider shrunken estimator from whole model, or data reduction (again not using \(Y\))
    • \(q\) regression d.f. for reduced model
    • Assume best case: discarded dimensions had no association with \(Y\)
    • Expected loss in LR is \(p-q\)
    • New shrinkage \([{\rm LR} - (p-q) - q]/[{\rm LR} - (p-q)]\)
    • Solve for \(q\) \(\rightarrow\) \(q \leq ({\rm LR}-p)/9\)
    • Under these assumptions, no hope unless original LR \(> p+9\)
    • No \(\chi^2\) lost by dimension reduction \(\rightarrow\) \(q \leq {\rm LR}/10\)


    • Binary logistic model, 45 events on 150 subjects
    • 10:1 rule \(\rightarrow\) analyze 4.5 d.f. total
    • Analyst wishes to include age, sex, 10 others
    • Not known if age linear or if age and sex additive
    • 4 knots \(\rightarrow\) \(3+1+1\) d.f. for age and sex if restrict interaction to be linear
    • Full model with 15 d.f. has LR=50
    • Expected shrinkage factor \((50-15)/50 = 0.7\)
    • LR\(>15+9=24\) \(\rightarrow\) reduction may help
    • Reduction to \(q = (50-15)/9 \approx 4\) d.f. necessary
    • Have to assume age linear, reduce other 10 to 1 d.f.
    • Separate hypothesis tests intended \(\rightarrow\) use full model, adjust for multiple comparisons
    Summary of Some Data Reduction Methods
    Goals Reasons Methods
    Group predictors so that each group represents a single dimension that can be summarized with a single score \(\downarrow\) d.f. arising from multiple predictors
    • Make \(PC_1\) more reasonable summary
    Variable clustering

    • Subject matter knowledge
    • Group predictors to maximize proportion of variance explained by \(PC_1\) of each group
    • Hierarchical clustering using a matrix of similarity measures between predictors
    Transform predictors \(\downarrow\) d.f. due to nonlinear and dummy variable components
    • Allows predictors to be optimally combined
    • Make \(PC_1\) more reasonable summary
    • Use in customized model for imputing missing values on each predictor
    • Maximum total variance on a group of related predictors
    • Canonical variates on the total set of predictors

    4.8 Other Approaches to Predictive Modeling

    4.9 Overly Influential Observations

    • Every observation should influence fit
    • Major results should not rest on 1 or 2 obs.
    • Overly infl. obs. \(\rightarrow\) \(\uparrow\) variance of predictions
    • Also affects variable selection

    Reasons for influence:

    • Too few observations for complexity of model (see Section 4.7, Section 4.3)
    • Data transcription or entry errors
    • Extreme values of a predictor
    1. Sometimes subject so atypical should remove from dataset
    2. Sometimes truncate measurements where data density ends
    3. Example: \(n=4000\), 2000 deaths, white blood count range 500-100,000, .05,.95 quantiles=2755, 26700
    4. Linear spline function fit
    5. Sensitive to WBC\(>60000\ (n=16)\)
    6. Predictions stable if truncate WBC to 40000 (\(n=46\) above 40000)
    • Disagreements between predictors and response. Ignore unless extreme values or another explanation
    • Example: \(n=8000\), one extreme predictor value not on straight line relationship with other \((X,Y)\) \(\rightarrow\) \(\chi^{2}=36\) for \(H_{0}:\) linearity

    Statistical Measures:

    • Leverage: capacity to be influential (not necessarily infl.)
      Diagonals of “hat matrix” \(H = X(X'X)^{-1}X'\) — measures how an obs. predicts its own response (D. A. Belsley et al., 1980)
    • \(h_{ii}>2(p+1)/n\) may signal a high leverage point (D. A. Belsley et al., 1980)
    • DFBETAS: change in \(\hat{\beta}\) upon deletion of each obs, scaled by s.e.
    • DFFIT: change in \(X\hat{\beta}\) upon deletion of each obs
    • DFFITS: DFFIT standardized by s.e. of \(\hat{\beta}\)
    • Some classify obs as overly influential when \(|{\rm DFFITS}|>2\sqrt{(p+1)/(n-p-1)}\) (D. A. Belsley et al., 1980)
    • Others examine entire distribution for “outliers”
    • No substitute for careful examination of data (Chatfield, 1991; Spence & Garrison, 1993)
    • Maximum likelihood estimation requires 1-step approximations

    4.10 Comparing Two Models

    • Level playing field (independent datasets, same no. candidate d.f., careful bootstrapping)
    • Criteria:
      1. calibration
      2. discrimination
      3. face validity
      4. measurement errors in required predictors
      5. use of continuous predictors (which are usually better defined than categorical ones)
      6. omission of “insignificant” variables that nonetheless make sense as risk factors
      7. simplicity (though this is less important with the availability of computers)
      8. lack of fit for specific types of subjects
    • Goal is to rank-order: ignore calibration
    • Otherwise, dismiss a model having poor calibration
    • Good calibration \(\rightarrow\) compare discrimination (e.g., \(R^{2}\) (Nagelkerke, 1991), model \(\chi^2\), Somers’ \(D_{xy}\), Spearman’s \(\rho\), area under ROC curve)
    • Worthwhile to compare models on a measure not used to optimize either model, e.g., mean absolute error, median absolute error if using OLS
    • Rank measures may not give enough credit to extreme predictions \(\rightarrow\) model \(\chi^{2}, R^{2}\), examine extremes of distribution of \(\hat{Y}\)
    • Examine differences in predicted values from the two models
    • See (Peek et al., 2007; Pencina et al., 2008, 2011, 2012) for discussions and examples of low power for testing differences in ROC areas, and for other approaches.

    4.11 Improving the Practice of Multivariable Prediction

    See also Section 5.6

    Greenland (2000) discusses many important points:

    • Stepwise variable selection on confounders leaves important confounders uncontrolled
    • Shrinkage is far superior to variable selection
    • Variable selection does more damage to confidence interval widths than to point estimates
    • Claims about unbiasedness of ordinary MLEs are misleading because they assume the model is correct and is the only model entertained
    • “models need to be complex to capture uncertainty about the relations … an honest uncertainty assessment requires parameters for all effects that we know may be present. This advice is implicit in an anti-parsimony principle often attributed to L. J. Savage ‘All models should be as big as an elephant’ (see Draper, 1995)”

    Greenland’s example of inadequate adjustment for confounders as a result of using a bad modeling strategy:

    • Case-control study of diet, food constituents, breast cancer
    • 140 cases, 222 controls
    • 35 food constituent intakes and 5 confounders
    • Food intakes are correlated
    • Traditional stepwise analysis not adjusting simultaneously for all foods consumed \(\rightarrow\) 11 foods had \(P < 0.05\)
    • Full model with all 35 foods competing \(\rightarrow\) 2 had \(P < 0.05\)
    • Rigorous simultaneous analysis (hierarchical random slopes model) penalizing estimates for the number of associations examined \(\rightarrow\) no foods associated with breast cancer

    Global Strategies

    • Use a method known not to work well (e.g., stepwise variable selection without penalization; recursive partitioning), document how poorly the model performs (e.g. using the bootstrap), and use the model anyway
    • Develop a black box model that performs poorly and is difficult to interpret (e.g., does not incorporate penalization)
    • Develop a black box model that performs well and is difficult to interpret
    • Develop interpretable approximations to the black box
    • Develop an interpretable model (e.g. give priority to additive effects) that performs well and is likely to perform equally well on future data from the same stream

    Preferred Strategy in a Nutshell

    • Decide how many d.f. can be spent
    • Decide where to spend them
    • Spend them
    • Don’t reconsider, especially if inference needed

    4.12 Summary: Possible Modeling Strategies

    4.12.1 Developing Predictive Models

    1. Assemble accurate, pertinent data and lots of it, with wide distributions for \(X\).
    2. Formulate good hypotheses — specify relevant candidate predictors and possible interactions. Don’t use \(Y\) to decide which \(X\)’s to include.
    3. Characterize subjects with missing \(Y\). Delete such subjects in rare circumstances (Crawford et al., 1995). For certain models it is effective to multiply impute \(Y\).
    4. Characterize and impute missing \(X\). In most cases use multiple imputation based on \(X\) and \(Y\)
    5. For each predictor specify complexity or degree of nonlinearity that should be allowed (more for important predictors or for large \(n\)) (Section 4.1)
    6. Do data reduction if needed (pre-transformations, combinations), or use penalized estimation (Frank E. Harrell et al., 1998)
    7. Use the entire sample in model development
    8. Can do highly structured testing to simplify “initial” model
      • Test entire group of predictors with a single \(P\)-value
      • Make each continuous predictor have same number of knots, and select the number that optimizes AIC
      • Test the combined effects of all nonlinear terms with a single \(P\)-value
    9. Make tests of linearity of effects in the model only to demonstrate to others that such effects are often statistically significant. Don’t remove individual insignificant effects from the model.
    10. Check additivity assumptions by testing pre-specified interaction terms. Use a global test and either keep all or delete all interactions.
    11. Check to see if there are overly-influential observations.
    12. Check distributional assumptions and choose a different model if needed.
    13. Do limited backwards step-down variable selection if parsimony is more important that accuracy (Spiegelhalter, 1986). But confidence limits, etc., must account for variable selection (e.g., bootstrap).
    14. This is the “final” model.
    15. Interpret the model graphically and by computing predicted values and appropriate test statistics. Compute pooled tests of association for collinear predictors.
    16. Validate this model for calibration and discrimination ability, preferably using bootstrapping.
    17. Shrink parameter estimates if there is overfitting but no further data reduction is desired (unless shrinkage built-in to estimation)
    18. When missing values were imputed, adjust final variance-covariance matrix for imputation. Do this as early as possible because it will affect other findings.
    19. When all steps of the modeling strategy can be automated, consider using Faraway (1992) to penalize for the randomness inherent in the multiple steps.
    20. Develop simplifications to the final model as needed.

    4.12.2 Developing Models for Effect Estimation

    1. Less need for parsimony; even less need to remove insignificant variables from model (otherwise CLs too narrow)
    2. Careful consideration of interactions; inclusion forces estimates to be conditional and raises variances
    3. If variable of interest is mostly the one that is missing, multiple imputation less valuable
    4. Complexity of main variable specified by prior beliefs, compromise between variance and bias
    5. Don’t penalize terms for variable of interest
    6. Model validation less necessary

    4.12.3 Developing Models for Hypothesis Testing

    1. Virtually same as previous strategy
    2. Interactions require tests of effect by varying values of another variable, or “main effect + interaction” joint tests (e.g., is treatment effective for either sex, allowing effects to be different)
    3. Validation may help quantify over-adjustment