Case Study in Binary Logistic Regression, Model Selection and Approximation: Predicting Cause of Death
11.1 Overview
This chapter contains a case study on developing, describing, and validating a binary logistic regression model. In addition, the following methods are exemplified:
Data reduction using incomplete linear and nonlinear principal components
Use of AIC to choose from five modeling variations, deciding which is best for the number of parameters
Model simplification using stepwise variable selection and approximation of the full model
The relationship between the degree of approximation and the degree of predictive discrimination loss
Bootstrap validation that includes penalization for model uncertainty (variable selection) and that demonstrates a loss of predictive discrimination over the full model even when compensating for overfitting the full model.
The data reduction and pre-transformation methods used here were discussed in more detail in Chapter 8. Single imputation will be used because of the limited quantity of missing data.
11.2 Background
Consider the randomized trial of estrogen for treatment of prostate cancer (Byar & Green, 1980) described in Chapter 8. In this trial, larger doses of estrogen reduced the effect of prostate cancer but at the cost of increased risk of cardiovascular death. Kay (1986) did a formal analysis of the competing risks for cancer, cardiovascular, and other deaths. It can also be quite informative to study how treatment and baseline variables relate to the cause of death for those patients who died. (Larson & Dinse, 1985) We subset the original dataset of those patients dying from prostate cancer (\(n=130\)), heart or vascular disease (\(n=96\)), or cerebrovascular disease (\(n=31\)). Our goal is to predict cardiovascular–cerebrovascular death (cvd, \(n=127\)) given the patient died from either cvd or prostate cancer. Of interest is whether the time to death has an effect on the cause of death, and whether the importance of certain variables depends on the time of death.
11.3 Data Transformations and Single Imputation
In R, first obtain the desired subset of the data and do some preliminary calculations such as combining an infrequent category with the next category, and dichotomizing ekg for use in ordinary principal components (PCs).
Code
require(rms)options(prType='html')getHdata(prostate)prostate <-within(prostate, {levels(ekg)[levels(ekg) %in%c('old MI','recent MI')] <-'MI' ekg.norm <-1*(ekg %in%c('normal','benign'))levels(ekg) <-abbreviate(levels(ekg)) pfn <-as.numeric(pf)levels(pf) <-levels(pf)[c(1,2,3,3)] cvd <- status %in%c("dead - heart or vascular","dead - cerebrovascular") rxn =as.numeric(rx) })# Use transcan to compute optimal pre-transformationsptrans <-# See Figure (* @fig-prostate-transcan*)transcan(~ sz + sg + ap + sbp + dbp + age + wt + hg + ekg + pf + bm + hx + dtime + rx,imputed=TRUE, transformed=TRUE,data=prostate, pl=FALSE, pr=FALSE)# Use transcan single imputationsimp <-impute(ptrans, data=prostate, list.out=TRUE)
Imputed missing values with the following frequencies
and stored them in variables with their original names:
sz sg age wt ekg
5 11 1 2 8
11.4 Regression on Original Variables, Principal Components and Pretransformations
We first examine the performance of data reduction in predicting the cause of death, similar to what we did for survival time in Section Section 8.6. The first analyses assess how well PCs (on raw and transformed variables) predict the cause of death.
There are 127 cvds. We use the 15:1 rule of thumb discussed in Section 4.4 to justify using the first 8 PCs. ap is log-transformed because of its extreme distribution. We use Hmisc::princmp.
Code
# Compute the first 8 PCs on raw variables then on# transformed onesp <-princmp(~ sz + sg +log(ap) + sbp + dbp + age + wt + hg + ekg.norm + pfn + bm + hx + rxn + dtime,data=psub, k=8, sw=TRUE, kapprox=2)p
Principal Components Analysis
Stepwise Approximations to PCs With Cumulative R^2
PC 1
log(ap) (0.609) + bm (0.737) + dbp (0.803) + sg (0.86) + hg (0.901)
PC 2
dbp (0.371) + sg (0.544) + age (0.652) + sbp (0.743) + hx (0.803)
f8 f8t f g h
257.6573 254.5172 255.8545 263.8413 254.5317
Based on AIC, the more traditional model fitted to the raw data and assuming linearity for all the continuous predictors has only a slight chance of producing worse cross-validated predictive accuracy than other methods. The chances are also good that effect estimates from this simple model will have competitive mean squared errors.
11.5 Description of Fitted Model
Here we describe the simple all-linear full model. Summary statistics and Wald and likelihood ratio ANOVA tables are below, followed by partial effects plots with pointwise confidence bands, and odds ratios over default ranges of predictors.
Code
f
Logistic Regression Model
lrm(formula = cvd ~ sz + sg + log(ap) + sbp + dbp + age + wt +
hg + ekg + pf + bm + hx + rx + dtime, data = psub, x = TRUE,
y = TRUE)
The van Houwelingen–Le Cessie heuristic shrinkage estimate (Equation 4.1) is \(\hat{\gamma}=0.85\), indicating that this model will validate on new data about 15% worse than on this dataset.
11.6 Backwards Step-Down
Now use fast backward step-down (with total residual AIC as the stopping rule) to identify the variables that explain the bulk of the cause of death. Later validation will take this screening of variables into account. The greatly reduced model results in a simple nomogram.
nom <-nomogram(fred, ap=c(.1, .5, 1, 5, 10, 50),fun=plogis, funlabel="Probability",fun.at=c(.01,.05,.1,.25,.5,.75,.9,.95,.99))plot(nom, xfrac=.45)
It is readily seen from this model that patients with a history of heart disease, and patients with less extensive prostate cancer are those more likely to die from cvd rather than from cancer. But beware that it is easy to over-interpret findings when using unpenalized estimation, and confidence intervals are too narrow. Let us use the bootstrap to study the uncertainty in the selection of variables and to penalize for this uncertainty when estimating predictive performance of the model. The variables selected in the first 20 bootstrap resamples are shown, making it obvious that the set of “significant” variables, i.e., the final model, is somewhat arbitrary.
Code
v <-validate(f, B=200, bw=TRUE)
Code
print(v, B=20, digits=3)
Index
Original
Sample
Training
Sample
Test
Sample
Optimism
Corrected
Index
Successful
Resamples
Dxy
0.682
0.709
0.64
0.07
0.612
200
R2
0.439
0.476
0.388
0.087
0.352
200
Intercept
0
0
-0.004
0.004
-0.004
200
Slope
1
1
0.814
0.186
0.814
200
Emax
0
0
0.047
0.047
0.047
200
D
0.395
0.441
0.342
0.1
0.296
200
U
-0.008
-0.008
0.015
-0.023
0.015
200
Q
0.403
0.449
0.326
0.123
0.28
200
B
0.162
0.152
0.174
-0.022
0.184
200
g
1.932
2.174
1.734
0.44
1.491
200
gp
0.341
0.353
0.318
0.035
0.306
200
Factors Retained in Backwards Elimination
First 20 Resamples
sz
sg
ap
sbp
dbp
age
wt
hg
ekg
pf
bm
hx
rx
dtime
⚬
⚬
⚬
⚬
⚬
⚬
⚬
⚬
⚬
⚬
⚬
⚬
⚬
⚬
⚬
⚬
⚬
⚬
⚬
⚬
⚬
⚬
⚬
⚬
⚬
⚬
⚬
⚬
⚬
⚬
⚬
⚬
⚬
⚬
⚬
⚬
⚬
⚬
⚬
⚬
⚬
⚬
⚬
⚬
⚬
⚬
⚬
⚬
⚬
⚬
⚬
⚬
⚬
⚬
⚬
⚬
⚬
⚬
⚬
⚬
⚬
⚬
⚬
⚬
⚬
⚬
⚬
⚬
⚬
⚬
⚬
⚬
⚬
⚬
⚬
⚬
Frequencies of Numbers of Factors Retained
1
2
3
4
5
6
7
8
9
10
5
40
58
45
18
24
5
1
3
1
The slope shrinkage (\(\hat{\gamma}\)) is a bit lower than was estimated above. There is drop-off in all indexes. The estimated likely future predictive discrimination of the model as measured by Somers’ \(D_{xy}\) fell from 0.682 to 0.612. The latter estimate is the one that should be claimed when describing model performance.
A nearly unbiased estimate of future calibration of the stepwise-derived model is given below.
Backwards Step-down - Original Model
Deleted Chi-Sq d.f. P Residual d.f. P AIC
ekg 6.76 5 0.2391 6.76 5 0.2391 -3.24
bm 0.09 1 0.7639 6.85 6 0.3349 -5.15
hg 0.38 1 0.5378 7.23 7 0.4053 -6.77
sbp 0.48 1 0.4881 7.71 8 0.4622 -8.29
wt 1.11 1 0.2932 8.82 9 0.4544 -9.18
dtime 1.47 1 0.2253 10.29 10 0.4158 -9.71
rx 5.65 3 0.1302 15.93 13 0.2528 -10.07
pf 4.78 2 0.0915 20.71 15 0.1462 -9.29
sg 4.28 1 0.0385 25.00 16 0.0698 -7.00
dbp 5.84 1 0.0157 30.83 17 0.0209 -3.17
Approximate Estimates after Deleting Factors
Coef S.E. Wald Z P
Intercept -3.74986 1.82887 -2.050 0.0403286
sz -0.04862 0.01532 -3.174 0.0015013
ap -0.40694 0.11117 -3.660 0.0002518
age 0.06000 0.02562 2.342 0.0191701
hx 0.86969 0.34339 2.533 0.0113198
Factors in Final Model
[1] sz ap age hx
Code
plot(cal)
n=257 Mean absolute error=0.028 Mean squared error=0.00106
0.9 Quantile of absolute error=0.056
The amount of overfitting seen in Figure 11.5 is consistent with the indexes produced by the validate function.
For comparison, consider a bootstrap validation of the full model without using variable selection.
Code
vfull <-validate(f, B=200)print(vfull, digits=3)
Index
Original
Sample
Training
Sample
Test
Sample
Optimism
Corrected
Index
Successful
Resamples
Dxy
0.786
0.837
0.738
0.099
0.687
200
R2
0.573
0.646
0.501
0.146
0.427
200
Intercept
0
0
-0.026
0.026
-0.026
200
Slope
1
1
0.678
0.322
0.678
200
Emax
0
0
0.091
0.091
0.091
200
D
0.558
0.661
0.467
0.193
0.364
200
U
-0.008
-0.008
0.057
-0.065
0.057
200
Q
0.566
0.669
0.411
0.258
0.308
200
B
0.133
0.113
0.15
-0.036
0.17
200
g
2.688
3.55
2.368
1.183
1.505
200
gp
0.394
0.418
0.365
0.052
0.342
200
Compared to the validation of the full model, the step-down model has less optimism, but it started with a smaller \(D_{xy}\) due to loss of information from removing moderately important variables. The improvement in optimism was not enough to offset the effect of eliminating variables. If shrinkage were used with the full model, it would have better calibration and discrimination than the reduced model, since shrinkage does not diminish \(D_{xy}\). Thus stepwise variable selection failed at delivering excellent predictive discrimination.
Finally, compare previous results with a bootstrap validation of a step-down model using a better significance level for a variable to stay in the model (\(\alpha=0.5\), Steyerberg et al. (2000)) and using individual approximate Wald tests rather than tests combining all deleted variables.
Backwards Step-down - Original Model
Deleted Chi-Sq d.f. P Residual d.f. P AIC
ekg 6.76 5 0.2391 6.76 5 0.2391 -3.24
bm 0.09 1 0.7639 6.85 6 0.3349 -5.15
hg 0.38 1 0.5378 7.23 7 0.4053 -6.77
sbp 0.48 1 0.4881 7.71 8 0.4622 -8.29
wt 1.11 1 0.2932 8.82 9 0.4544 -9.18
dtime 1.47 1 0.2253 10.29 10 0.4158 -9.71
rx 5.65 3 0.1302 15.93 13 0.2528 -10.07
Approximate Estimates after Deleting Factors
Coef S.E. Wald Z P
Intercept -4.86308 2.67292 -1.819 0.068852
sz -0.05063 0.01581 -3.202 0.001366
sg -0.28038 0.11014 -2.546 0.010903
ap -0.24838 0.12369 -2.008 0.044629
dbp 0.28288 0.13036 2.170 0.030008
age 0.08502 0.02690 3.161 0.001572
pf=in bed < 50% daytime 0.81151 0.66376 1.223 0.221485
pf=in bed > 50% daytime -2.19885 1.21212 -1.814 0.069670
hx 0.87834 0.35203 2.495 0.012592
Factors in Final Model
[1] sz sg ap dbp age pf hx
Code
print(v5, digits=3, B=0)
Index
Original
Sample
Training
Sample
Test
Sample
Optimism
Corrected
Index
Successful
Resamples
Dxy
0.739
0.808
0.719
0.089
0.65
200
R2
0.517
0.606
0.483
0.124
0.393
200
Intercept
0
0
-0.017
0.017
-0.017
200
Slope
1
1
0.731
0.269
0.731
200
Emax
0
0
0.073
0.073
0.073
200
D
0.486
0.605
0.446
0.159
0.327
200
U
-0.008
-0.008
0.039
-0.047
0.039
200
Q
0.494
0.613
0.407
0.206
0.288
200
B
0.147
0.123
0.155
-0.032
0.179
200
g
2.351
3.066
2.189
0.877
1.474
200
gp
0.372
0.403
0.359
0.044
0.328
200
The performance statistics are midway between the full model and the smaller stepwise model.
11.7 Model Approximation
Frequently a better approach than stepwise variable selection is to approximate the full model, using its estimates of precision, as discussed in Section 5.5. Stepwise variable selection as well as regression trees are useful for making the approximations, and the sacrifice in predictive accuracy is always apparent.
We begin by computing the “gold standard” linear predictor from the full model fit (\(R^{2} = 1.0\)), then running backwards step-down OLS regression to approximate it.
Code
spar(bty='l', ps=9)lp <-predict(f) # Compute linear predictor from full model# Insert sigma=1 as otherwise sigma=0 will cause problemsa <-ols(lp ~ sz + sg +log(ap) + sbp + dbp + age + wt + hg + ekg + pf + bm + hx + rx + dtime, sigma=1,data=psub)# Specify silly stopping criterion to remove all variabless <-fastbw(a, aics=10000)betas <- s$Coefficients # matrix, rows=iterationsX <-cbind(1, f$x) # design matrix# Compute the series of approximations to lpap <- X %*%t(betas)# For each approx. compute approximation R^2 and ratio of# likelihood ratio chi-square for approximate model to that# of original modelm <-ncol(ap) -1# all but intercept-only modelr2 <- frac <-numeric(m)fullchisq <- f$stats['Model L.R.']for(i in1:m) { lpa <- ap[,i] r2[i] <-cor(lpa, lp)^2 fapprox <-lrm(cvd ~ lpa, data=psub) frac[i] <- fapprox$stats['Model L.R.'] / fullchisq }plot(r2, frac, type='b',xlab=expression(paste('Approximation ', R^2)),ylab=expression(paste('Fraction of ', chi^2, ' Preserved')))abline(h=.95, col=gray(.83)); abline(v=.95, col=gray(.83))abline(a=0, b=1, col=gray(.83))
After 6 deletions, slightly more than 0.05 of both the LR \(\chi^2\) and the approximation \(R^2\) are lost. Therefore we take as our approximate model the one that removed 6 predictors. The equation for this model is below, and its nomogram is in the figure below.
Code
fapprox <-ols(lp ~ sz + sg +log(ap) + age + ekg + pf + hx + rx, data=psub)fapprox$stats['R2'] # as a check
Byar, D. P., & Green, S. B. (1980). The choice of treatment for cancer patients based on covariate information: Application to prostate cancer. Bulletin Cancer, Paris, 67, 477–488.
Kay, R. (1986). Treatment effects in competing-risks analysis of prostate cancer data. Biometrics, 42, 203–211.
Larson, M. G., & Dinse, G. E. (1985). A mixture model for the regression analysis of competing risks data. Appl Stat, 34, 201–211.
Steyerberg, E. W., Eijkemans, M. J. C., Harrell, F. E., & Habbema, J. D. F. (2000). Prognostic modelling with logistic regression analysis: A comparison of selection and estimation methods in small data sets. Stat Med, 19, 1059–1079.
```{r include=FALSE}require(Hmisc)options(qproject='rms', prType='html')require(qreport)getRs('qbookfun.r')hookaddcap()knitr::set_alias(w = 'fig.width', h = 'fig.height', cap = 'fig.cap', scap ='fig.scap')```# Binary Logistic Regression Case Study 1 {#sec-lrcase1}**Case Study in Binary Logistic Regression, Model Selection and Approximation: Predicting Cause of Death**## OverviewThis chapter contains a case study on developing, describing, andvalidating a binary logistic regression model. In addition, thefollowing methods are exemplified:1. Data reduction using incomplete linear and nonlinear principal components1. Use of AIC to choose from five modeling variations, deciding which is best for the number of parameters1. Model simplification using stepwise variable selection and approximation of the full model1. The relationship between the degree of approximation and the degree of predictive discrimination loss1. Bootstrap validation that includes penalization for model uncertainty (variable selection) and that demonstrates a loss of predictive discrimination over the full model even when compensating for overfitting the full model.The data reduction and pre-transformation methods used here were discussed inmore detail in @sec-impred. Single imputation will beused because of the limited quantity of missing data.## BackgroundConsider the randomized trial of estrogen for treatment ofprostate cancer [@bya80cho] described in @sec-impred. Inthis trial, larger doses of estrogen reduced the effect of prostatecancer but at the cost of increased risk of cardiovascular death.@kay86tre did a formal analysis of the competing risks forcancer, cardiovascular, and other deaths. It can also be quiteinformative to study how treatment and baseline variables relate tothe cause of death for those patients who died. [@lar85mix] Wesubset the original dataset of those patients dying from prostatecancer ($n=130$), heart or vascular disease ($n=96$), orcerebrovascular disease ($n=31$). Our goal is to predictcardiovascular--cerebrovascular death (`cvd`, $n=127$) given thepatient died fromeither `cvd` or prostate cancer. Of interest is whether the timeto death has an effect on the cause of death, and whether theimportance of certain variables depends on the time of death.## Data Transformations and Single ImputationIn `R`, first obtain the desired subset of the data and do some preliminarycalculations such as combining an infrequent category with the nextcategory, and dichotomizing `ekg` for use in ordinary principalcomponents (PCs).```{r}require(rms)options(prType='html')getHdata(prostate)prostate <-within(prostate, {levels(ekg)[levels(ekg) %in%c('old MI','recent MI')] <-'MI' ekg.norm <-1*(ekg %in%c('normal','benign'))levels(ekg) <-abbreviate(levels(ekg)) pfn <-as.numeric(pf)levels(pf) <-levels(pf)[c(1,2,3,3)] cvd <- status %in%c("dead - heart or vascular","dead - cerebrovascular") rxn =as.numeric(rx) })# Use transcan to compute optimal pre-transformationsptrans <-# See Figure (* @fig-prostate-transcan*)transcan(~ sz + sg + ap + sbp + dbp + age + wt + hg + ekg + pf + bm + hx + dtime + rx,imputed=TRUE, transformed=TRUE,data=prostate, pl=FALSE, pr=FALSE)# Use transcan single imputationsimp <-impute(ptrans, data=prostate, list.out=TRUE)NAvars <-all.vars(~ sz + sg + age + wt + ekg)for(x in NAvars) prostate[[x]] <- imp[[x]]subset <- prostate$status %in%c("dead - heart or vascular","dead - cerebrovascular","dead - prostatic ca")trans <- ptrans$transformed[subset,]psub <- prostate[subset,]```## Regression on Original Variables, Principal Components and Pretransformations {#sec-lrcase1-pc}We first examine the performance of data reduction in predicting thecause of death, similar to what we did for survival time inSection @sec-impred-pc. Thefirst analyses assess how well PCs (on raw andtransformed variables) predict the cause of death.There are 127 `cvd`s. We use the 15:1 rule of thumbdiscussed in @sec-multivar-overfit to justify using thefirst 8 PCs. `ap` is log-transformed because ofits extreme distribution. We use `Hmisc::princmp`.```{r}# Compute the first 8 PCs on raw variables then on# transformed onesp <-princmp(~ sz + sg +log(ap) + sbp + dbp + age + wt + hg + ekg.norm + pfn + bm + hx + rxn + dtime,data=psub, k=8, sw=TRUE, kapprox=2)pplot(p)``````{r}#| fig-height: 6plot(p, 'loadings')``````{r}pc8 <- p$scoresf8 <-lrm(cvd ~ pc8, data=psub)p <-princmp(trans, k=8, sw=TRUE, kapprox=2)pplot(p)``````{r}#| fig-height: 6plot(p, 'loadings')``````{r}pc8t <- p$scoresf8t <-lrm(cvd ~ pc8t, data=psub)# Fit binary logistic model on original variables# x=TRUE y=TRUE are for test='LR', validate, calibratef <-lrm(cvd ~ sz + sg +log(ap) + sbp + dbp + age + wt + hg + ekg + pf + bm + hx + rx + dtime,x=TRUE, y=TRUE, data=psub)# Expand continuous variables using splinesg <-lrm(cvd ~rcs(sz,4) +rcs(sg,4) +rcs(log(ap),4) +rcs(sbp,4) +rcs(dbp,4) +rcs(age,4) +rcs(wt,4) +rcs(hg,4) + ekg + pf + bm + hx + rx +rcs(dtime,4),data=psub)# Fit binary logistic model on individual transformed var.h <-lrm(cvd ~ trans, data=psub)```The five approaches to modeling the outcome are compared using AIC(where smaller is better).```{r}c(f8=AIC(f8), f8t=AIC(f8t), f=AIC(f), g=AIC(g), h=AIC(h))```Based on AIC, the more traditional model fitted to the raw data and assuminglinearity for all the continuous predictors has only a slight chance ofproducing worse cross-validated predictive accuracy than other methods.The chances are also good that effect estimates from this simplemodel will have competitive mean squared errors.## Description of Fitted ModelHere we describe the simple all-linear full model.Summary statistics and Wald and likelihood ratio ANOVA tables are below, followed by partial effects plots with pointwise confidence bands, and odds ratiosover default ranges of predictors.```{r}fanova(f)an <-anova(f, test='LR')an``````{r h=2.75,w=3.5,cap='Ranking of apparent importance of predictors of cause of death using LR statistics'}#| label: fig-lrcase1-fullspar(ps=8,top=0.5)plot(an)s <- f$statsgamma.hat <- (s['Model L.R.'] - s['d.f.'])/s['Model L.R.']``````{r h=6.5,w=6,cap='Partial effects (log odds scale) in full model for cause of death, along with vertical line segments showing the raw data distribution of predictors',scap='Partial effects in cause of death model'}#| label: fig-lrcase1-fullpeffectsdd <- datadist(psub); options(datadist='dd')ggplot(Predict(f), sepdiscrete='vertical', vnames='names', rdata=psub, histSpike.opts=list(frac=function(f) .1*f/max(f) ))``````{r h=5.5,w=5,top=2,cap='Interquartile-range odds ratios for continuous predictors and simple odds ratios for categorical predictors. Numbers at left are upper quartile : lower quartile or current group : reference group. The bars represent $0.9, 0.95, 0.99$ confidence limits. The intervals are drawn on the log odds ratio scale and labeled on the odds ratio scale. Ranges are on the original scale.',scap='Interquartile-range odds ratios and confidence limits'}#| label: fig-lrcase1-fullorplot(summary(f), log=TRUE)```The van Houwelingen--Le Cessie heuristicshrinkage estimate (@eq-heuristic-shrink) is$\hat{\gamma}=`r round(gamma.hat,2)`$, indicating that this modelwill validate on new data about `r round(100*(1 - gamma.hat))`%worse than on this dataset.## Backwards Step-DownNow use fast backward step-down (with total residual AIC as the stoppingrule) to identify the variables that explain the bulk of the cause ofdeath. Later validation will take this screening of variables intoaccount.The greatly reduced model results in a simple nomogram.```{r}fastbw(f)``````{r}fred <-lrm(cvd ~ sz +log(ap) + age + hx, data=psub)latex(fred)``````{r w=5.5,h=5,ps=8,cap='Nomogram calculating $X\\hat{\\beta}$ and $\\hat{P}$ for `cvd` as the cause of death, using the step-down model. For each predictor, read the points assigned on the 0--100 scale and add these points. Read the result on the `Total Points` scale and then read the corresponding predictions below it.',scap='Nomogram for obtaining $X\\hat{\\beta}$ and $\\hat{P}$ from step-down model'}#| label: fig-lrcase1-nomnom <- nomogram(fred, ap=c(.1, .5, 1, 5, 10, 50), fun=plogis, funlabel="Probability", fun.at=c(.01,.05,.1,.25,.5,.75,.9,.95,.99))plot(nom, xfrac=.45)```It is readily seen from this model that patients with ahistory of heart disease, and patients with less extensive prostate cancerare those more likely to die from `cvd` rather than fromcancer.But beware that it is easy to over-interpret findings when usingunpenalized estimation, and confidence intervals are toonarrow. Let us use the bootstrap to study the uncertainty in theselection of variables and to penalize for this uncertainty whenestimating predictive performance of the model. The variablesselected in the first 20 bootstrap resamples are shown, making itobvious that the set of "significant" variables, i.e., the finalmodel, is somewhat arbitrary.```{r results='hide'}v <- validate(f, B=200, bw=TRUE)``````{r}print(v, B=20, digits=3)```The slope shrinkage ($\hat{\gamma}$) is a bit lower than was estimatedabove. There is drop-off in all indexes. The estimated likely futurepredictive discrimination of the model as measured by Somers' $D_{xy}$fell from `r round(v['Dxy','index.orig'],3)` to`r round(v['Dxy','index.corrected'],3)`. The latter estimate isthe one that should be claimed when describing model performance.A nearly unbiased estimate of future calibration of thestepwise-derived model is given below.```{r h=3,w=4.75,cap='Bootstrap overfitting-corrected calibration curve estimate for the backwards step-down cause of death logistic model, along with a rug plot showing the distribution of predicted risks. The smooth nonparametric calibration estimator (`loess`) is used.',scap='Bootstrap nonparametric calibration curve for reduced cause of death model'}#| label: fig-lrcase1-calspar(ps=9, bot=1)cal <- calibrate(f, B=200, bw=TRUE)plot(cal)```The amount of overfitting seen in @fig-lrcase1-cal isconsistent with the indexes produced by the `validate` function.For comparison, consider a bootstrap validation of the full modelwithout using variable selection.```{r}vfull <- validate(f, B=200)print(vfull, digits=3)```Compared to the validation of the full model, the step-down modelhas less optimism, but it started with a smaller $D_{xy}$ due toloss of information from removing moderately important variables. Theimprovementin optimism was not enough to offset the effect of eliminating variables.If shrinkage were used with the full model, it would have bettercalibration and discrimination than the reduced model, since shrinkagedoes not diminish $D_{xy}$. Thus stepwise variable selection failedat delivering excellent predictive discrimination.Finally, compare previous results with a bootstrap validation of astep-down model using a better significance level for a variable tostay in the model ($\alpha=0.5$, @ste00pro) and using individualapproximate Wald tests rather than tests combining all deleted variables.```{r}v5 <-validate(f, bw=TRUE, sls=0.5, type='individual', B=200)``````{r}print(v5, digits=3, B=0)```The performance statistics are midway between the full model and thesmaller stepwise model.## Model Approximation {#sec-lrcase1-approx}Frequently a better approach than stepwise variable selection is toapproximate the full model, using its estimates of precision, asdiscussed in @sec-val-approx. Stepwise variableselection as well as regression trees are useful for making theapproximations, and the sacrifice in predictive accuracy is always apparent.We begin by computing the "gold standard" linear predictor from thefull model fit ($R^{2} = 1.0$), then running backwards step-down OLSregression to approximate it.```{r h=3,w=3.5,cap='Fraction of explainable variation (full model LR $\\chi^2$) in `cvd` that was explained by approximate models, along with approximation accuracy ($x$-axis)',scap='Model approximation vs. LR $\\chi^2$ preserved'}#| label: fig-lrcase1-approxr2spar(bty='l', ps=9)lp <- predict(f) # Compute linear predictor from full model# Insert sigma=1 as otherwise sigma=0 will cause problemsa <- ols(lp ~ sz + sg + log(ap) + sbp + dbp + age + wt + hg + ekg + pf + bm + hx + rx + dtime, sigma=1, data=psub)# Specify silly stopping criterion to remove all variabless <- fastbw(a, aics=10000)betas <- s$Coefficients # matrix, rows=iterationsX <- cbind(1, f$x) # design matrix# Compute the series of approximations to lpap <- X %*% t(betas)# For each approx. compute approximation R^2 and ratio of# likelihood ratio chi-square for approximate model to that# of original modelm <- ncol(ap) - 1 # all but intercept-only modelr2 <- frac <- numeric(m)fullchisq <- f$stats['Model L.R.']for(i in 1:m) { lpa <- ap[,i] r2[i] <- cor(lpa, lp)^2 fapprox <- lrm(cvd ~ lpa, data=psub) frac[i] <- fapprox$stats['Model L.R.'] / fullchisq }plot(r2, frac, type='b', xlab=expression(paste('Approximation ', R^2)), ylab=expression(paste('Fraction of ', chi^2, ' Preserved')))abline(h=.95, col=gray(.83)); abline(v=.95, col=gray(.83))abline(a=0, b=1, col=gray(.83))```After 6 deletions, slightly more than 0.05 of both the LR $\chi^2$ andthe approximation $R^2$ arelost. Therefore we take as ourapproximate model the one that removed 6 predictors. The equation forthis model is below, and its nomogram is in the figure below.```{r}fapprox <- ols(lp ~ sz + sg + log(ap) + age + ekg + pf + hx + rx, data=psub)fapprox$stats['R2'] # as a checklatex(fapprox)``````{r h=6,w=6,cap='Nomogram for predicting the probability of `cvd` based on the approximate model',scap='Approximate nomogram for predicting cause of death'}#| label: fig-lrcase1-nomapproxspar(ps=8)nom <- nomogram(fapprox, ap=c(.1, .5, 1, 5, 10, 20, 30, 40), fun=plogis, funlabel="Probability", lp.at=(-5):4, fun.lp.at=qlogis(c(.01,.05,.25,.5,.75,.95,.99)))plot(nom, xfrac=.45)``````{r echo=FALSE}saveCap('11')```