Recall that the aim of data reduction is to reduce (without using the outcome) the number of parameters needed in the outcome model. The following case study illustrates these techniques:
redundancy analysis;
variable clustering;
data reduction using principal component analysis (PCA), sparse PCA, and pre-transformations;
restricted cubic spline fitting using ordinary least squares, in the context of scaling; and
scaling/variable transformations using canonical variates and nonparametric additive regression.
8.1 Data
Consider the 506-patient prostate cancer dataset from Byar & Green (1980). The data are listed in [Andrews & Herzberg (1985); Table 46] and are available at https://hbiostat.org/data. These data were from a randomized trial comparing four treatments for stage 3 and 4 prostate cancer, with almost equal numbers of patients on placebo and each of three doses of estrogen. Four patients had missing values on all of the following variables: wt, pf, hx, sbp, dbp, ekg, hg, bm; two of these patients were also missing sz. These patients are excluded from consideration. The ultimate goal of an analysis of the dataset might be to discover patterns in survival or to do an analysis of covariance to assess the effect of treatment while adjusting for patient heterogeneity. See Chapter Chapter 21 for such analyses. The data reductions developed here are general and can be used for a variety of dependent variables.
The variable names, labels, and a summary of the data are printed below. Because of extreme skewness, ap is log-transformed for the purpose of making its spike histogram.
Code
require(Hmisc)getHdata(prostate) # Download and make prostate accessible# Convert an old date format to R formatprostate$sdate <-as.Date(prostate$sdate)d <-describe(prostate[2:17],trans=list(ap=list('log', log, exp)))sparkline::sparkline(0) # load sparkline javascript dependencies
9 Continous Variables of 16 Variables, 502 Observations
Variable
Label
n
Missing
Distinct
Info
Mean
pMedian
Gini |Δ|
Quantiles
.05 .10 .25 .50 .75 .90 .95
dtime
Months of Follow-up
502
0
76
1.000
36.13
36
26.89
1.05 5.00 14.25 34.00 57.75 67.00 71.00
age
Age in Years
501
1
41
0.996
71.46
72.5
7.497
56 60 70 73 76 78 80
wt
Weight Index = wt(kg)-ht(cm)+200
500
2
67
0.999
99.03
98.5
14.93
77.95 82.90 90.00 98.00 107.00 116.00 123.00
sbp
Systolic Blood Pressure/10
502
0
18
0.980
14.35
14
2.596
11 12 13 14 16 17 18
dbp
Diastolic Blood Pressure/10
502
0
12
0.945
8.149
8
1.553
6 6 7 8 9 10 10
hg
Serum Hemoglobin (g/100ml)
502
0
91
1.000
13.45
13.55
2.16
10.2 10.7 12.3 13.7 14.7 15.8 16.4
sz
Size of Primary Tumor (cm^2)
497
5
55
0.998
14.63
13
13.05
2.0 3.0 5.0 11.0 21.0 32.0 39.2
sg
Combined Index of Stage and Hist. Grade
491
11
11
0.959
10.31
10
2.245
8 8 9 10 11 13 13
ap
Serum Prostatic Acid Phosphatase
502
0
128
0.996
12.18
1.35
21.71
0.300 0.300 0.500 0.700 2.975 21.689 38.470
prostate[2:17] Descriptives
7 Categorical Variables of 16 Variables, 502 Observations
Variable
Label
n
Missing
Distinct
Info
Sum
Mean
stage
Stage
502
0
2
0.733
3.424
rx
502
0
4
status
502
0
10
pf
502
0
4
hx
History of Cardiovascular Disease
502
0
2
0.733
213
0.4243
ekg
494
8
7
bm
Bone Metastases
502
0
2
0.410
82
0.1633
stage is defined by ap as well as X-ray results. Of the patients in stage 3, 0.92 have ap\(\leq\) 0.8. Of those in stage 4, 0.93 have ap > 0.8. Since stage can be predicted almost certainly from ap, we do not consider stage in some of the analyses.
8.2 How Many Parameters Can Be Estimated?
There are 354 deaths among the 502 patients. If predicting survival time were of major interest, we could develop a reliable model if no more than about \(354/15 = 24\) parameters were examined against \(Y\) in unpenalized modeling. Suppose that a full model with no interactions is fitted and that linearity is not assumed for any continuous predictors. Assuming age is almost linear, we could fit a restricted cubic spline function with three knots. For the other continuous variables, let us use five knots. For categorical predictors, the maximum number of degrees of freedom needed would be one fewer than the number of categories. For pf we could lump the last two categories since the last category has only 2 patients. Likewise, we could combine the last two levels of ekg. Table 8.1 lists candidate predictors with the maximum number of parameters we consider for each.
Table 8.1: Degrees of fredom needed for predictors
Predictor:
rx
age
wt
pf
hx
sbp
dbp
ekg
hg
sz
sg
ap
bm
Number of Parameters:
3
2
4
2
1
4
4
5
4
4
4
4
1
8.3 Redundancy Analysis
As described in Section Section 4.7.1, it is occasionally useful to do a rigorous redundancy analysis on a set of potential predictors. Let us run the algorithm discussed there, on the set of predictors we are considering. We will use a low threshold (0.3) for \(R^2\) for demonstration purposes.
Code
# Allow only 1 d.f. for three of the predictorsprostate <-transform(prostate,ekg.norm =1*(ekg %in%c("normal","benign")),rxn =as.numeric(rx),pfn =as.numeric(pf))# Force pfn, rxn to be linear because of difficulty of placing# knots with so many ties in the data# Note: all incomplete cases are deleted (inefficient)r <-redun(~ stage +I(rxn) + age + wt +I(pfn) + hx + sbp + dbp + ekg.norm + hg + sz + sg + ap + bm,r2=.3, type='adjusted', data=prostate)r
Redundancy Analysis
~stage + I(rxn) + age + wt + I(pfn) + hx + sbp + dbp + ekg.norm +
hg + sz + sg + ap + bm
n: 483 p: 14 nk: 3
Number of NAs: 0
Transformation of target variables forced to be linear
R-squared cutoff: 0.3 Type: adjusted
R^2 with which each variable can be predicted from all other variables:
stage I(rxn) age wt I(pfn) hx sbp dbp
0.658 0.000 0.073 0.111 0.156 0.062 0.452 0.417
ekg.norm hg sz sg ap bm
0.055 0.146 0.192 0.540 0.147 0.391
Rendundant variables:
stage sbp bm sg
Predicted from variables:
I(rxn) age wt I(pfn) hx dbp ekg.norm hg sz ap
Variable Deleted R^2 R^2 after later deletions
1 stage 0.658 0.658 0.646 0.494
2 sbp 0.452 0.453 0.455
3 bm 0.374 0.367
4 sg 0.342
Code
r2describe(r$scores, nvmax=4) # show strongest predictors of each variable
By any reasonable criterion on \(R^2\), none of the predictors is redundant. stage can be predicted with an \(R^{2} = 0.658\) from the other 13 variables, but only with \(R^{2} = 0.493\) after deletion of 3 variables later declared to be “redundant.”
8.4 Variable Clustering
From Table 8.1, the total number of parameters is 42, so some data reduction should be considered. We resist the temptation to take the “easy way out” using stepwise variable selection so that we can achieve a more stable modeling process 1 and obtain unbiased standard errors. Before using a variable clustering procedure, note that ap is extremely skewed. To handle skewness, we use Spearman rank correlations for continuous variables (later we transform each variable using transcan, which will allow ordinary correlation coefficients to be used). After classifying ekg as “normal/benign” versus everything else, the Spearman correlations are plotted below.
1Sauerbrei & Schumacher (1992) used the bootstrap to demonstrate the variability of a standard variable selection procedure for the prostate cancer dataset.
Code
x <-with(prostate,cbind(stage, rx, age, wt, pf, hx, sbp, dbp, ekg.norm, hg, sz, sg, ap, bm))# If no missing data, could use cor(apply(x, 2, rank))r <-rcorr(x, type="spearman")$r # rcorr in Hmiscmaxabsr <-max(abs(r[row(r) !=col(r)]))
Code
plotCorrM(r)[[1]] # An Hmisc function
We perform a hierarchical cluster analysis based on a similarity matrix that contains pairwise Hoeffding \(D\) statistics (Hoeffding, 1948)\(D\) will detect nonmonotonic associations.
We combine sbp and dbp, and tentatively combine ap, sg, sz, and bm.
8.5 Transformation and Single Imputation Using transcan
Now we turn to the scoring of the predictors to potentially reduce the number of regression parameters that are needed later by doing away with the need for nonlinear terms and multiple dummy variables. The RHmisc package transcan function defaults to using a maximum generalized variance method (Kuhfeld, 2009) that incorporates canonical variates to optimally transform both sides of a multiple regression model. Each predictor is treated in turn as a variable being predicted, and all variables are expanded into restricted cubic splines (for continuous variables) or dummy variables (for categorical ones).
Code
spar(bot=1)# Combine 2 levels of ekg (one had freq. 1)levels(prostate$ekg)[levels(prostate$ekg) %in%c('old MI', 'recent MI')] <-'MI'prostate$pf.coded <-as.integer(prostate$pf)# make a numeric version; combine last 2 levels of originallevels(prostate$pf) <-levels(prostate$pf)[c(1,2,3,3)]ptrans <-transcan(~ sz + sg + ap + sbp + dbp + age + wt + hg + ekg + pf + bm + hx, imputed=TRUE,transformed=TRUE, trantab=TRUE, pl=FALSE,show.na=TRUE, data=prostate, frac=.1, pr=FALSE)summary(ptrans, digits=4)
Note that at face value the transformation of ap was derived in a circular manner, since the combined index of stage and histologic grade, sg, uses in its stage component a cutoff on ap. However, if sg is omitted from consideration, the resulting transformation for ap does not change appreciably. Note that bm and hx are represented as binary variables, so their coefficients in the table of canonical variable coefficients are on a different scale. For the variables that were actually transformed, the coefficients are for standardized transformed variables (mean 0, variance 1). From examining the \(R^2\)s, age, wt, ekg, pf, and hx are not strongly related to other variables. Imputations for age, wt, ekg are thus relying more on the median or modal values from the marginal distributions. From the coefficients of first (standardized) canonical variates, sbp is predicted almost solely from dbp; bm is predicted mainly from ap, hg, and pf. 2
2Schemper & Heinze (1997) used logistic models to impute dichotomizations of the predictors for this dataset.
8.6 Data Reduction Using Principal Components
The first PC, PC\(_{1}\), is the linear combination of standardized variables having maximum variance. PC\(_{2}\) is the linear combination of predictors having the second largest variance such that PC\(_{2}\) is orthogonal to (uncorrelated with) PC\(_{1}\). If there are \(p\) raw variables, the first \(k\) PCs, where \(k < p\), will explain only part of the variation in the whole system of \(p\) variables unless one or more of the original variables is exactly a linear combination of the remaining variables. Note that it is common to scale and center variables to have mean zero and variance 1 before computing PCs.
The response variable (here, time until death due to any cause) is not examined during data reduction, so that if PCs are selected by variance explained in the \(X\)-space and not by variation explained in \(Y\), one needn’t correct for model uncertainty or multiple comparisons.
PCA results in data reduction when the analyst uses only a subset of the \(p\) possible PCs in predicting \(Y\). This is called incomplete principal component regression. When one sequentially enters PCs into a predictive model in a strict pre-specified order (i.e., by descending amounts of variance explained for the system of \(p\) variables), model uncertainty requiring bootstrap adjustment is minimized. In contrast, model uncertainty associated with stepwise regression (driven by associations with \(Y\)) is massive.
For the prostate dataset, consider PCs on raw candidate predictors, expanding polytomous factors using indicator variables. The R function princmp is used, after singly imputing missing raw values using transcan’s optimal additive nonlinear models. In this series of analyses we ignore the treatment variable, rx. princmp calls the built-in princomp function. princmp assists with interpretation of PCs.
Code
spar(top=1, ps=7)# Impute all missing values in all variables given to transcanimputed <-impute(ptrans, data=prostate, list.out=TRUE)
Imputed missing values with the following frequencies
and stored them in variables with their original names:
sz sg age wt ekg
5 11 1 2 8
Code
imputed <-as.data.frame(imputed)# Compute principal components on imputed data. princmp will expand# the categorical variable EKG into indicator variables.# Use correlation matrixpfn <- prostate$pfnprin.raw <-princmp(~ sz + sg + ap + sbp + dbp + age + wt + hg + ekg + pfn + bm + hx,k=16, sw=TRUE, data=imputed)prin.raw
Principal Components Analysis
Stepwise Approximations to PCs With Cumulative R^2
PC 1
bm (0.486) + dbp (0.643) + sg (0.743) + hg (0.814) + ap (0.871)
PC 2
sbp (0.514) + ekgheart strain (0.648) + sg (0.754) + dbp (0.836) + pfn
(0.888)
PC 3
ekgMI (0.376) + hx (0.63) + age (0.795) + pfn (0.894) + hg (0.942)
PC 4
ekgheart strain (0.453) + ekgrhythmic disturb & electrolyte ch (0.631)
+ sbp (0.754) + sz (0.809) + ekgheart block or conduction def (0.853)
PC 5
age (0.198) + ekgbenign (0.37) + pfn (0.516) + ekgMI (0.624) + wt
(0.718)
The plot shown in Figure 8.4 is called a “scree” plot [Jolliffe (2010); pp. 96–99, 104, 106]. It shows the variation explained by the first \(k\) principal components as \(k\) increases all the way to 16 parameters (no data reduction). It requires 10 of the 16 possible components to explain \(> 0.8\) of the variance, and the first 5 components explain \(0.49\) of the variance of the system. Two of the 16 dimensions are almost totally redundant.
Code
plot(prin.raw, 'loadings', k=6)
Code
plot(prin.trans, 'loadings', k=6)
After repeating this process when transforming all predictors via transcan, we have only 12 degrees of freedom for the 12 predictors. The variance explained is depicted in Figure 8.4 in red. It requires at least 8 of the 12 possible components to explain \(\geq 0.8\) of the variance, and the first 5 components explain \(0.66\) of the variance as opposed to \(0.49\) for untransformed variables.
Let us see how the PCs “explain” the times until death using the Cox regression (Cox, 1972) function from rms, cph, described in Chapter 20. In what follows we vary the number of components used in the Cox models from 1 to all 16, computing the AIC for each model. AIC is related to model log likelihood penalized for number of parameters estimated, and lower is better. For reference, the AIC of the model using all of the original predictors, and the AIC of a full additive spline model are shown as horizontal lines.
Code
require(rms)spar(bty='l')S <-with(prostate, Surv(dtime, status !="alive"))# two-column response var.pcs <- prin.raw$scores # pick off all PCsaic <-numeric(16)for(i in1:16) { ps <- pcs[,1:i] aic[i] <-AIC(cph(S ~ ps))}plot(1:16, aic, xlab='Number of Components Used',ylab='AIC', type='l', ylim=c(3950,4000))f <-cph(S ~ sz + sg +log(ap) + sbp + dbp + age + wt + hg + ekg + pf + bm + hx, data=imputed)abline(h=AIC(f), col='blue')## The following model in the 2nd edition no longer converges# f <- cph(S ~ rcs(sz,5) + rcs(sg,5) + rcs(log(ap),5) +# rcs(sbp,5) + rcs(dbp,5) + rcs(age,3) + rcs(wt,5) +# rcs(hg,5) + ekg + pf + bm + hx,# tol=1e-14, data=imputed)f <-cph(S ~rcs(sz,4) +rcs(sg,4) +rcs(log(ap),5) +rcs(sbp,4) +rcs(dbp,4) +rcs(age,3) +rcs(wt,4) +rcs(hg,4) + ekg + pf + bm + hx,tol=1e-14, data=imputed)abline(h=AIC(f), col='blue', lty=2)
For the money, the first 5 components adequately summarizes all variables, if linearly transformed, and the full linear model is no better than this. The model allowing all continuous predictors to be nonlinear is not worth its added degrees of freedom.
Next check the performance of a model derived from cluster scores of transformed variables.
Code
# Compute PC1 on a subset of transcan-transformed predictorspco <-function(v) { f <-princmp(ptrans$transformed[,v])print(f) vars <-attr(f, 'results')$varscat('Fraction of variance explained by PC1:',round(vars[1] /sum(vars),2), '\n') f$scores[,1]}tumor <-pco(c('sz','sg','ap','bm'))
Principal Components Analysis
Fraction of variance explained by PC1:
Code
bp <-pco(c('sbp','dbp'))
Principal Components Analysis
Fraction of variance explained by PC1:
Code
cardiac <-pco(c('hx','ekg'))
Principal Components Analysis
Fraction of variance explained by PC1:
Code
# Get transformed individual variables that are not clusteredother <- ptrans$transformed[,c('hg','age','pf','wt')]f <-cph(S ~ tumor + bp + cardiac + other) # other is matrixAIC(f)
[1] 3954.393
Code
print(f, long=FALSE, title='')
Model Tests
Discrimination
Indexes
Obs 502
LR χ2 81.11
R2 0.149
Events 354
d.f. 7
R27,502 0.137
Center 0
Pr(>χ2) 0.0000
R27,354 0.189
Score χ2 86.81
Dxy 0.286
Pr(>χ2) 0.0000
β
S.E.
Wald Z
Pr(>|Z|)
tumor
-0.1723
0.0367
-4.69
<0.0001
bp
0.0251
0.0424
0.59
0.5528
cardiac
0.2513
0.0516
4.87
<0.0001
hg
-0.1407
0.0554
-2.54
0.0111
age
-0.1034
0.0579
-1.79
0.0739
pf
-0.0933
0.0487
-1.92
0.0551
wt
-0.0910
0.0555
-1.64
0.1012
The tumor and cardiac clusters seem to dominate prediction of mortality, and the AIC of the model built from cluster scores of transformed variables compares favorably with other models (Figure 8.5).
8.6.1 Sparse Principal Components
A disadvantage of principal components is that every predictor receives a nonzero weight for every component, so many coefficients are involved even through the effective degrees of freedom with respect to the response model are reduced. Sparse principal components(Witten & Tibshirani, 2008) uses a penalty function to reduce the magnitude of the loadings variables receive in the components. If an L1 penalty is used (as with the lasso, some loadings are shrunk to zero, resulting in some simplicity. Sparse principal components combines some elements of variable clustering, scoring of variables within clusters, and redundancy analysis.
Filzmoser et al. (2012) have written a nice R package pcaPP for doing sparse PC analysis.3 The following example uses the prostate data again. To allow for nonlinear transformations and to score the ekg variable in the prostate dataset down to a scalar, we use the transcan-transformed predictors as inputs.
3 The spcr package is another sparse PC package that should also be considered.
Code
s <-princmp(ptrans$transformed, k=10, method='sparse', sw=TRUE, nvmax=3)s
Sparse Principal Components Analysis
Stepwise Approximations to PCs With Cumulative R^2
PC 1
sg (0.83) + ap (0.962) + sz (0.992)
PC 2
sbp (0.836) + dbp (1)
PC 3
hg (1)
PC 4
hx (1)
PC 5
wt (1)
Code
plot(s)
Code
plot(s, 'loadings', nrow=1)
Only nonzero loadings are shown. The first sparse PC is the tumor cluster used above, and the second is the blood pressure cluster. Let us see how well incomplete sparse principal component regression predicts time until death.
More components are required to optimize AIC than were seen in Figure 8.5, but a model built from 6–8 sparse PCs performed as well as the other models.
8.7 Transformation Using Nonparametric Smoothers
The ACE nonparametric additive regression method of Breiman & Friedman (1985) transforms both the left-hand-side variable and all the right-hand-side variables so as to optimize \(R^{2}\). ACE can be used to transform the predictors using the Race function in the acepack package, called by the transace function in the Hmisc package. transace does not impute data but merely does casewise deletion of missing values. Here transace is run after single imputation by transcan. binary is used to tell transace which variables not to try to predict (because they need no transformation). Several predictors are restricted to be monotonically transformed.
Transformations Using Alternating Conditional Expectation
~monotone(sz) + monotone(sg) + monotone(ap) + monotone(sbp) +
monotone(dbp) + monotone(age) + monotone(pf) + wt + hg +
ekg + bm + hx
n= 502
Transformations:
sz sg ap sbp dbp age
monotone monotone monotone monotone monotone monotone
pf wt hg ekg bm hx
monotone general general categorical linear linear
R-squared achieved in predicting each variable:
sz sg ap sbp dbp age pf wt hg ekg bm hx
0.228 0.575 0.573 0.483 0.458 0.152 0.178 0.172 0.202 0.111 NA NA
Code
ggplot(f)
Except for ekg, age, and for arbitrary sign reversals, the transformations in Figure 8.8 determined using transace were similar to those in Figure 8.3. The transcan transformation for ekg makes more sense.
8.8 Study Questions
Section 8.2
Critique the choice of the number of parameters devoted to each predictor.
Section 8.3
Explain why the final amount of redundancy can change from the initial assessment.
Section 8.5
What is the meaning of the first canonical variate?
Explain the first row of numbers in the matrix of coefficients of canonical variates.
The transformations in Figure 8.3 are not optimal for predicting the outcome. What good can be said of them?
Section 8.6
Why in general terms do principal components work well in predictive modeling?
What is the advantage of sparse PCs over regular PCs?
Andrews, D. F., & Herzberg, A. M. (1985). Data. Springer-Verlag.
Breiman, L., & Friedman, J. H. (1985). Estimating optimal transformations for multiple regression and correlation (with discussion). J Am Stat Assoc, 80, 580–619.
Byar, D. P., & Green, S. B. (1980). The choice of treatment for cancer patients based on covariate information: Application to prostate cancer. Bulletin Cancer, Paris, 67, 477–488.
Cox, D. R. (1972). Regression models and life-tables (with discussion). J Roy Stat Soc B, 34, 187–220.
Sauerbrei, W., & Schumacher, M. (1992). A bootstrap resampling procedure for model building: Application to the Cox regression model. Stat Med, 11, 2093–2109.
Schemper, M., & Heinze, G. (1997). Probability imputation revisited for prognostic factor studies. Stat Med, 16, 73–80.
imputation of missing covariables using logistic model;comparison with multiple imputation;analysis of prostate cancer dataset
Witten, D. M., & Tibshirani, R. (2008). Testing significance of features by lassoed principal components. Ann Appl Stat, 2(3), 986–1012.
reduction in false discovery rates over using a vector of t-statistics;borrowing strength across genes;"one would not expect a single gene to be associated with the outcome, since, in practice, many genes work together to effect a particular phenotype. LPC effectively down-weights individual genes that are associated with the outcome but that do not share an expression pattern with a larger group of genes, and instead favors large groups of genes that appear to be differentially-expressed.";regress principal components on outcome;sparse principal components
```{r include=FALSE}require(Hmisc)require(qreport)options(qproject='rms', prType='html')getRs('qbookfun.r')hookaddcap()knitr::set_alias(w = 'fig.width', h = 'fig.height', cap = 'fig.cap', scap ='fig.scap')```# Case Study in Data Reduction {#sec-impred}Recall that the aim of data reduction is to reduce (without using theoutcome) the number of parameters needed in the outcome model.The following case study illustrates these techniques:1. redundancy analysis;1. variable clustering;1. data reduction using principal component analysis (PCA), sparse PCA, and pre\-transformations;1. restricted cubic spline fitting using ordinary least squares, in the context of scaling; and1. scaling/variable transformations using canonical variatesand nonparametric additive regression.## DataConsider the 506-patient \index{datasets!prostate}prostate cancerdataset from @bya80cho. The data are listedin [@data; Table 46] and areavailable at <https://hbiostat.org/data>.These data were from a randomized trial comparing four treatments forstage 3 and 4 prostate cancer, withalmost equal numbers of patients on placebo and each of three doses ofestrogen. Four patients had missing values on all of the followingvariables:`wt, pf, hx, sbp, dbp, ekg, hg, bm`; two of these patients werealso missing `sz`. These patients are excluded fromconsideration. The ultimate goal of an analysis of the dataset mightbe to discover patterns in survival or to do an analysis of covarianceto assess the effect of treatment while adjusting for patientheterogeneity. See Chapter @sec-coxcase for such analyses.The data reductions developed here aregeneral and can be used for a variety of dependent variables.The variable names, labels, and a summary of the data are printed below. Because of extreme skewness, `ap` is log-transformed for the purpose of making its spike histogram.```{r results='asis'}require(Hmisc)getHdata(prostate) # Download and make prostate accessible# Convert an old date format to R formatprostate$sdate <- as.Date(prostate$sdate)d <- describe(prostate[2:17], trans=list(ap=list('log', log, exp)))sparkline::sparkline(0) # load sparkline javascript dependenciesmaketabs(print(d, 'both'), initblank=TRUE, wide=TRUE)````stage` is defined by `ap` as well as X-ray results. Of thepatients in stage 3, 0.92 have `ap` $\leq$ 0.8. Of those in stage 4,0.93 have `ap` > 0.8. Since `stage` can be predicted almostcertainly from `ap`, we do not consider `stage` in some of theanalyses.## How Many Parameters Can Be Estimated?There are 354 deaths among the 502 patients. If predicting survivaltime were of major interest, we could develop a reliablemodel if no more than about $354/15 = 24$ parameterswere _examined_ against $Y$ in unpenalized modeling. Suppose thata full model with nointeractions is fitted and that linearity is not assumed for anycontinuous predictors. Assuming `age` is almost linear, we couldfit a restricted cubic spline function with three knots. For the othercontinuous variables, let us use five knots. For categorical predictors,the maximum number of degrees of freedom needed would be one fewer thanthe number of categories. For `pf` we could lump the last twocategories since the last category has only 2 patients. Likewise, wecould combine the last two levels of `ekg`.@tbl-impred-maxdf lists candidatepredictors with the maximum number of parameters we consider for each.| Predictor: | `rx` | `age` | `wt` | `pf` | `hx` | `sbp` | `dbp` | `ekg` | `hg` | `sz` | `sg` | `ap` | `bm` ||-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|| Number of Parameters: | 3 | 2 | 4 | 2 | 1 | 4 | 4 | 5 | 4 | 4 | 4 | 4 | 1 |: Degrees of fredom needed for predictors {#tbl-impred-maxdf}## Redundancy AnalysisAs described in Section @sec-multivar-redun, it is occasionally useful todo a rigorous redundancy analysis on a set of potential predictors.Let us run the algorithm discussed there, on the set of predictors weare considering. We will use a low threshold (0.3) for $R^2$ fordemonstration purposes.```{r}# Allow only 1 d.f. for three of the predictorsprostate <-transform(prostate,ekg.norm =1*(ekg %in%c("normal","benign")),rxn =as.numeric(rx),pfn =as.numeric(pf))# Force pfn, rxn to be linear because of difficulty of placing# knots with so many ties in the data# Note: all incomplete cases are deleted (inefficient)r <-redun(~ stage +I(rxn) + age + wt +I(pfn) + hx + sbp + dbp + ekg.norm + hg + sz + sg + ap + bm,r2=.3, type='adjusted', data=prostate)rr2describe(r$scores, nvmax=4) # show strongest predictors of each variable```By any reasonable criterion on $R^2$, none of the predictors isredundant. `stage` can be predicted with an $R^{2} = 0.658$ fromthe other 13 variables, but only with $R^{2} = 0.493$ after deletionof 3 variables later declared to be "redundant."## Variable ClusteringFrom @tbl-impred-maxdf, the total number of parameters is42, so some data reduction should beconsidered. We resist the temptation to take the "easy way out"using stepwise variable selection so that we can achieve a more stablemodeling process ^[@sau92boo used the bootstrap to demonstrate the variability of a standard variable selection procedure for the prostate cancer dataset.] and obtainunbiased standard errors. Before using a variableclustering procedure, note that `ap` is extremely skewed. Tohandle skewness, we use Spearman rank correlations for continuousvariables (later we transform each variable using`transcan`, which will allow ordinary correlation coefficients tobe used). After classifying `ekg` as "normal/benign" versuseverything else, the Spearman correlations are plotted below.```{r}x <-with(prostate,cbind(stage, rx, age, wt, pf, hx, sbp, dbp, ekg.norm, hg, sz, sg, ap, bm))# If no missing data, could use cor(apply(x, 2, rank))r <-rcorr(x, type="spearman")$r # rcorr in Hmiscmaxabsr <-max(abs(r[row(r) !=col(r)]))``````{r spearman,h=7,w=7,cap='Matrix of Spearman $\\rho$ rank correlation coefficients between predictors. Horizontal gray scale lines correspond to $\\rho=0$. The tallest bar corresponds to $|\\rho|=0.785$.',scap='Spearman $\\rho$ rank correlations of predictors'}#| label: fig-impred-spearmanplotCorrM(r)[[1]] # An Hmisc function```We perform a hierarchicalcluster analysis based on a similarity matrix that contains pairwiseHoeffding $D$ statistics [@hoe48non] $D$ will detectnonmonotonic associations.```{r hclust,h=3.75,w=5,cap="Hierarchical clustering using Hoeffding's $D$ as a similarity measure. Dummy variables were used for the categorical variable `ekg. Some of the dummy variables cluster together since they are by definition negatively correlated.",scap="'Hierarchical clustering"}#| label: fig-impred-hclustvc <- varclus(~ stage + rxn + age + wt + pfn + hx + sbp + dbp + ekg.norm + hg + sz + sg + ap + bm, sim='hoeffding', data=prostate)plot(vc)```We combine `sbp` and `dbp`, and tentatively combine`ap, sg, sz`, and `bm`.## Transformation and Single Imputation Using `transcan` {#sec-impred-transcan}Now we turn to the scoring of the predictors to potentially reduce thenumber of regression parameters that are needed later by doing awaywith the need for nonlinear terms and multiple dummy variables.The `R` `Hmisc` package `transcan` functiondefaults to using a maximum generalized variancemethod [@prinqual] that incorporates canonical variates tooptimally transform both sides of a multiple regression model. Eachpredictor is treated in turn as a variable being predicted, and allvariables are expanded into restricted cubic splines (for continuous variables) or dummy variables (for categorical ones).```{r transcan,w=6.5,h=5,cap='Simultaneous transformation and single imputation of all candidate predictors using `transcan`. Imputed values are shown as red plus signs. Transformed values are arbitrarily scaled to $[0,1]$.',scap='Simultaneous transformation and imputation using `transcan`'}#| label: fig-impred-transcanspar(bot=1)# Combine 2 levels of ekg (one had freq. 1)levels(prostate$ekg)[levels(prostate$ekg) %in% c('old MI', 'recent MI')] <- 'MI'prostate$pf.coded <- as.integer(prostate$pf)# make a numeric version; combine last 2 levels of originallevels(prostate$pf) <- levels(prostate$pf)[c(1,2,3,3)]ptrans <- transcan(~ sz + sg + ap + sbp + dbp + age + wt + hg + ekg + pf + bm + hx, imputed=TRUE, transformed=TRUE, trantab=TRUE, pl=FALSE, show.na=TRUE, data=prostate, frac=.1, pr=FALSE)summary(ptrans, digits=4)ggplot(ptrans, scale=TRUE) + theme(axis.text.x=element_text(size=6))```Note that at face value the transformation of `ap` was derived in acircular manner,since the combined index of stage and histologic grade, `sg`,uses in its stage component a cutoff on `ap`. However, if `sg`is omitted from consideration, the resulting transformation for `ap`does not change appreciably. Note that `bm` and `hx` arerepresented as binary variables, so their coefficients in the table ofcanonical variable coefficientsare on a different scale. For the variables that were actually transformed,the coefficients are for standardized transformed variables (mean 0,variance 1). From examining the $R^2$s,`age, wt, ekg, pf`, and `hx` are notstrongly related to other variables. Imputations for `age, wt, ekg`are thus relying more on the median or modal values from the marginaldistributions.From the coefficients of first (standardized) canonical variates,`sbp` is predicted almost solely from `dbp`; `bm` ispredicted mainly from `ap, hg`, and `pf`. ^[@sch97pro used logistic models to impute dichotomizations of the predictors for this dataset.]## Data Reduction Using Principal Components {#sec-impred-pc}The first PC, PC$_{1}$, is the linear combination of standardizedvariables having maximum variance. PC$_{2}$ is the linearcombination of predictors having the second largest variance such thatPC$_{2}$ is orthogonal to (uncorrelated with) PC$_{1}$. If there are$p$ raw variables, the first $k$ PCs, where $k < p$, will explain onlypart of the variation in the whole system of $p$ variables unless oneor more of the original variables is exactly a linear combination ofthe remaining variables. Note that it is common to scale and centervariables to have mean zero and variance 1 before computing PCs.The response variable (here, time until death due to any cause) is notexamined during data reduction, so that if PCs are selected byvariance explained in the $X$-space and not by variation explained in$Y$, one needn't correct for model uncertainty or multiple comparisons.PCA results in data reduction when the analyst uses only a subset ofthe $p$ possible PCs in predicting $Y$. This is called_incomplete principal component regression_.When one sequentially enters PCs into a predictive model in a strictpre-specified order (i.e., by descending amounts of variance explainedfor the system of $p$ variables), model uncertainty requiringbootstrap adjustment is minimized. In contrast, model uncertainty associatedwith stepwise regression (driven by associations with $Y$) is massive.For the prostate dataset, consider PCson raw candidate predictors, expanding polytomous factors using indicator variables.The `R` function `princmp` is used, after singly imputing missing rawvalues using `transcan`'s optimal additive nonlinear models. Inthis series of analyses we ignore the treatment variable, `rx`. `princmp` calls the built-in `princomp` function. `princmp` assists with interpretation of PCs.```{r pc,w=4,h=3,cap='Variance of the system of raw predictors (black) explained by individual principal components (lines) along with cumulative proportion of variance explained (text), and variance explained by components computed on `transcan`-transformed variables (red)',scap='Variance of the system explained by principal components.'}#| label: fig-impred-pc#| fig.width: 7#| column: page-inset-rightspar(top=1, ps=7)# Impute all missing values in all variables given to transcanimputed <- impute(ptrans, data=prostate, list.out=TRUE)imputed <- as.data.frame(imputed)# Compute principal components on imputed data. princmp will expand# the categorical variable EKG into indicator variables.# Use correlation matrixpfn <- prostate$pfnprin.raw <- princmp(~ sz + sg + ap + sbp + dbp + age + wt + hg + ekg + pfn + bm + hx, k=16, sw=TRUE, data=imputed)prin.rawplot(prin.raw, ylim=c(0,4), offset=1)prin.trans <- princmp(ptrans$transformed, k=12)prin.transplot(prin.trans, add=TRUE, col='red', offset=-1.3)```The plot shown in @fig-impred-pc is called a"scree" plot [@jolBookpca; pp. 96--99, 104, 106]. It shows the variationexplained by the first $k$ principal components as $k$ increases allthe way to 16 parameters (no data reduction). It requires 10of the 16 possible components to explain $> 0.8$ of the variance,and the first 5 components explain $0.49$ of the variance of thesystem. Two of the 16 dimensions are almost totally redundant.```{r}#| fig-height: 7plot(prin.raw, 'loadings', k=6)``````{r}#| fig-height: 5.5plot(prin.trans, 'loadings', k=6)```After repeating this process when transforming all predictors via`transcan`, we have only 12 degrees of freedom for the 12predictors. The variance explained is depicted in @fig-impred-pc in red.It requires at least 8 of the 12 possible components to explain $\geq 0.8$of the variance, and the first 5 components explain $0.66$ of thevariance as opposed to $0.49$ for untransformed variables.Let us see how the PCs "explain" the times until death using theCox regression [@cox72] function from `rms`, `cph`, described in@sec-cox.In what follows we vary the number of components used inthe Cox models from 1 to all 16, computing the AIC for eachmodel. AIC is related to model log likelihood penalizedfor number of parameters estimated, and lower is better.For reference, the AIC of the model using all of the originalpredictors, and the AIC of a full additive spline model are shown ashorizontal lines.```{r aic,cap='AIC of Cox models fitted with progressively more principal components. The solid blue line depicts the AIC of the model with all original covariates. The dotted blue line is positioned at the AIC of the full spline model.',scap='AIC vs. number of principal components'}#| label: fig-impred-aicrequire(rms)spar(bty='l')S <- with(prostate, Surv(dtime, status != "alive"))# two-column response var.pcs <- prin.raw$scores # pick off all PCsaic <- numeric(16)for(i in 1:16) { ps <- pcs[,1:i] aic[i] <- AIC(cph(S ~ ps))}plot(1:16, aic, xlab='Number of Components Used', ylab='AIC', type='l', ylim=c(3950,4000))f <- cph(S ~ sz + sg + log(ap) + sbp + dbp + age + wt + hg + ekg + pf + bm + hx, data=imputed)abline(h=AIC(f), col='blue')## The following model in the 2nd edition no longer converges# f <- cph(S ~ rcs(sz,5) + rcs(sg,5) + rcs(log(ap),5) +# rcs(sbp,5) + rcs(dbp,5) + rcs(age,3) + rcs(wt,5) +# rcs(hg,5) + ekg + pf + bm + hx,# tol=1e-14, data=imputed)f <- cph(S ~ rcs(sz,4) + rcs(sg,4) + rcs(log(ap),5) + rcs(sbp,4) + rcs(dbp,4) + rcs(age,3) + rcs(wt,4) + rcs(hg,4) + ekg + pf + bm + hx, tol=1e-14, data=imputed)abline(h=AIC(f), col='blue', lty=2)```For the money, the first 5 components adequately summarizes all variables,if linearly transformed, and the full linear model is no better thanthis. The model allowing all continuous predictors to be nonlinear isnot worth its added degrees of freedom.<!-- TODO The model allowing all continuous predictors to be nonlinear is---><!-- better than the linear full model but not b etter than the 5---><!-- component model.--->Next check the performance of a model derived from cluster scores oftransformed variables.```{r}# Compute PC1 on a subset of transcan-transformed predictorspco <-function(v) { f <-princmp(ptrans$transformed[,v])print(f) vars <-attr(f, 'results')$varscat('Fraction of variance explained by PC1:',round(vars[1] /sum(vars),2), '\n') f$scores[,1]}tumor <-pco(c('sz','sg','ap','bm'))bp <-pco(c('sbp','dbp'))cardiac <-pco(c('hx','ekg'))# Get transformed individual variables that are not clusteredother <- ptrans$transformed[,c('hg','age','pf','wt')]f <-cph(S ~ tumor + bp + cardiac + other) # other is matrixAIC(f)``````{r}print(f, long=FALSE, title='')```The `tumor` and `cardiac` clusters seem to dominate predictionof mortality, and the AIC of the model built from cluster scores oftransformed variables compares favorably with other models(@fig-impred-aic).### Sparse Principal Components {#sec-impred-sparsepc}A disadvantage of principal components is that every predictorreceives a nonzero weight for every component, so many coefficientsare involved even through the effective degrees of freedom withrespect to the response model are reduced. _Sparse principal components_ [@wit08tes] uses a penalty function to reduce themagnitude of the loadings variables receive in the components. If anL1 penalty is used (as with the_lasso_, some loadingsare shrunk to zero, resulting in some simplicity. Sparse principalcomponents combines some elements of variable clustering, scoring ofvariables within clusters, andredundancy analysis.@pcaPP have written a nice `R` package `pcaPP` for doing sparse PCanalysis.^[The `spcr` package is another sparse PC package that should also be considered.]The following example uses the prostate data again.To allow for nonlinear transformations and to score the `ekg`variable in the prostate dataset down to a scalar, we use the`transcan`-transformed predictors as inputs.<!-- N% changed s$loadings to unclass() below--->```{r spca,cap='Variance explained by individual sparse principal components (lines) along with cumulative proportion of variance explained (text)',scap='Sparse principal components'}#| label: fig-impred-spcas <- princmp(ptrans$transformed, k=10, method='sparse', sw=TRUE, nvmax=3)splot(s)``````{r}#| fig-height: 3.5plot(s, 'loadings', nrow=1)```<!-- # Computing loadings on the original transcan scales---><!-- xtrans <- ptrans$transformed---><!-- cof <- matrix(NA, nrow=ncol(xtrans), ncol=10,---><!-- dimnames=list(colnames(xtrans), colnames(s$scores)))---><!-- for(i in 1:10) cof[,i] <- coef(lsfit(xtrans, s$scores[,i]))[-1]---><!-- tcof <- format(round(cof, 5))---><!-- tcof[abs(cof) < 1e-8] <- ''---><!-- print(tcof, quote=FALSE)--->Only nonzero loadings are shown. The first sparse PC is the`tumor` cluster used above, and the second is the blood pressurecluster. Let us see how well incomplete sparse principal componentregression predicts time until death.```{r spc,cap='Performance of sparse principal components in Cox models',scap='Performance of sparse principal components'}#| label: fig-impred-spcspar(bty='l')pcs <- s$scores # pick off sparse PCsaic <- numeric(10)for(i in 1:10) { ps <- pcs[,1:i] aic[i] <- AIC(cph(S ~ ps))}plot(1:10, aic, xlab='Number of Components Used', ylab='AIC', type='l', ylim=c(3950,4000))```More components are required to optimize AIC than were seen in @fig-impred-aic, but a model built from 6--8 sparse PCsperformed as well as the other models.## Transformation Using Nonparametric SmoothersThe ACE nonparametric additive regression method of @bre85est transforms both the left-hand-side variableand all the right-hand-side variables so as to optimize $R^{2}$. ACEcan be used to transform the predictors usingthe `R``ace` function in the `acepack`package, called by the `transace` function in the`Hmisc` package. `transace` doesnot impute data but merely does casewise deletion of missing values.Here `transace` is run after single imputation by `transcan`.`binary` is used to tell `transace` which variables not to tryto predict (because they need no transformation). Several predictorsare restricted to be monotonically transformed.```{r ace,h=4.5,w=6,cap='Simultaneous transformation of all variables using ACE.',scap='Transformation of variables using ACE'}#| label: fig-impred-acef <- transace(~ monotone(sz) + monotone(sg) + monotone(ap) + monotone(sbp) + monotone(dbp) + monotone(age) + monotone(pf) + wt + hg + ekg + bm + hx, data=imputed)fggplot(f)```Except for `ekg`, `age`, and for arbitrary sign reversals, thetransformations in @fig-impred-ace determined using`transace` were similar to those in @fig-impred-transcan.The `transcan` transformation for `ekg` makes more sense.## Study Questions**Section 8.2**1. Critique the choice of the number of parameters devoted to each predictor.**Section 8.3**1. Explain why the final amount of redundancy can change from the initial assessment.**Section 8.5**1. What is the meaning of the first canonical variate?1. Explain the first row of numbers in the matrix of coefficients of canonical variates.1. The transformations in Figure 8.3 are not optimal for predicting the outcome. What good can be said of them?**Section 8.6**1. Why in general terms do principal components work well in predictive modeling?1. What is the advantage of sparse PCs over regular PCs?```{r echo=FALSE}saveCap('08')```