# 8  Case Study in Data Reduction

Recall that the aim of data reduction is to reduce (without using the outcome) the number of parameters needed in the outcome model. The following case study illustrates these techniques:

1. redundancy analysis;
2. variable clustering;
3. data reduction using principal component analysis (PCA), sparse PCA, and pre-transformations;
4. restricted cubic spline fitting using ordinary least squares, in the context of scaling; and
5. scaling/variable transformations using canonical variates and nonparametric additive regression.

## 8.1 Data

Consider the 506-patient prostate cancer dataset from Byar & Green (1980). The data are listed in [Andrews & Herzberg (1985); Table 46] and are available at https://hbiostat.org/data. These data were from a randomized trial comparing four treatments for stage 3 and 4 prostate cancer, with almost equal numbers of patients on placebo and each of three doses of estrogen. Four patients had missing values on all of the following variables: wt, pf, hx, sbp, dbp, ekg, hg, bm; two of these patients were also missing sz. These patients are excluded from consideration. The ultimate goal of an analysis of the dataset might be to discover patterns in survival or to do an analysis of covariance to assess the effect of treatment while adjusting for patient heterogeneity. See Chapter Chapter 21 for such analyses. The data reductions developed here are general and can be used for a variety of dependent variables.

The variable names, labels, and a summary of the data are printed below.

Code
require(Hmisc)
getHdata(prostate)  # Download and make prostate accessible
# Convert an old date format to R format
prostate$sdate <- as.Date(prostate$sdate)
d <- describe(prostate[2:17])
html(d)
prostate[2:17] Descriptives
prostate[2:17]

16 Variables   502 Observations

stage: Stage
nmissingdistinctInfoMeanGmd
502020.7333.4240.4895
 Value          3     4
Frequency    289   213
Proportion 0.576 0.424


rx nmissingdistinct
50204
 Value              placebo 0.2 mg estrogen 1.0 mg estrogen 5.0 mg estrogen
Frequency              127             124             126             125
Proportion           0.253           0.247           0.251           0.249


dtime: Months of Follow-up nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
502076136.1326.89 1.05 5.0014.2534.0057.7567.0071.00
lowest : 0 1 2 3 4 , highest: 72 73 74 75 76
status nmissingdistinct
502010
 lowest : alive dead - prostatic ca dead - heart or vascular dead - cerebrovascular dead - pulmonary embolus highest: dead - other ca dead - respiratory disease dead - other specific non-ca dead - unspecified non-ca dead - unknown cause

age: Age in Years nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
5011410.99671.467.49756607073767880
lowest : 48 49 50 51 52 , highest: 84 85 87 88 89
wt: Weight Index = wt(kg)-ht(cm)+200 nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
5002670.99999.0314.93 77.95 82.90 90.00 98.00107.00116.00123.00
lowest : 69 71 72 73 74 , highest: 136 142 145 150 152
pf nmissingdistinct
50204
 Value           normal activity in bed < 50% daytime in bed > 50% daytime
Frequency                   450                   37                   13
Proportion                0.896                0.074                0.026

Value           confined to bed
Frequency                     2
Proportion                0.004


hx: History of Cardiovascular Disease
nmissingdistinctInfoSumMeanGmd
502020.7332130.42430.4895

sbp: Systolic Blood Pressure/10 nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
5020180.9814.352.59611121314161718
lowest : 8 9 10 11 12 , highest: 21 22 23 24 30
 Value          8     9    10    11    12    13    14    15    16    17    18    19
Frequency      1     3    14    27    65    74    98    74    72    34    17    12
Proportion 0.002 0.006 0.028 0.054 0.129 0.147 0.195 0.147 0.143 0.068 0.034 0.024

Value         20    21    22    23    24    30
Frequency      3     2     3     1     1     1
Proportion 0.006 0.004 0.006 0.002 0.002 0.002


dbp: Diastolic Blood Pressure/10 nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
5020120.9458.1491.553 6 6 7 8 91010
lowest : 4 5 6 7 8 , highest: 11 12 13 14 18
 Value          4     5     6     7     8     9    10    11    12    13    14    18
Frequency      4     5    43   107   165    94    66     9     5     2     1     1
Proportion 0.008 0.010 0.086 0.213 0.329 0.187 0.131 0.018 0.010 0.004 0.002 0.002


ekg nmissingdistinct
49487
 lowest : normal benign rhythmic disturb & electrolyte ch heart block or conduction def heart strain highest: rhythmic disturb & electrolyte ch heart block or conduction def heart strain old MI recent MI

hg: Serum Hemoglobin (g/100ml) nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
502091113.452.1610.210.712.313.714.715.816.4
 lowest : 5.89941 7 7.19922 7.79981 8.19922 highest: 17.2969 17.5 17.5977 18.1992 21.1992

sz: Size of Primary Tumor (cm2) nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
4975550.99814.6313.05 2.0 3.0 5.011.021.032.039.2
lowest : 0 1 2 3 4 , highest: 54 55 61 62 69
sg: Combined Index of Stage and Hist. Grade nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
49111110.95910.312.245 8 8 910111313
lowest : 5 6 7 8 9 , highest: 11 12 13 14 15
 Value          5     6     7     8     9    10    11    12    13    14    15
Frequency      3     8     7    67   137    33   114    26    75     5    16
Proportion 0.006 0.016 0.014 0.136 0.279 0.067 0.232 0.053 0.153 0.010 0.033


ap: Serum Prostatic Acid Phosphatase nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
50201280.99612.1821.71 0.300 0.300 0.500 0.700 2.97521.68938.470
 lowest : 0.0999908 0.199982 0.299988 0.399963 0.5 highest: 316 353.5 367 596 999.875
 Value          0    10    20    30    40    50    60    80   100   130   150   160
Frequency    406    38    13    14    13     3     2     1     1     1     1     1
Proportion 0.809 0.076 0.026 0.028 0.026 0.006 0.004 0.002 0.002 0.002 0.002 0.002

Value        180   230   280   320   350   370   600  1000
Frequency      1     1     1     1     1     1     1     1
Proportion 0.002 0.002 0.002 0.002 0.002 0.002 0.002 0.002

For the frequency table, variable is rounded to the nearest 10
bm: Bone Metastases
nmissingdistinctInfoSumMeanGmd
502020.41820.16330.2739

stage is defined by ap as well as X-ray results. Of the patients in stage 3, 0.92 have ap $$\leq$$ 0.8. Of those in stage 4, 0.93 have ap > 0.8. Since stage can be predicted almost certainly from ap, we do not consider stage in some of the analyses.

## 8.2 How Many Parameters Can Be Estimated?

There are 354 deaths among the 502 patients. If predicting survival time were of major interest, we could develop a reliable model if no more than about $$354/15 = 24$$ parameters were examined against $$Y$$ in unpenalized modeling. Suppose that a full model with no interactions is fitted and that linearity is not assumed for any continuous predictors. Assuming age is almost linear, we could fit a restricted cubic spline function with three knots. For the other continuous variables, let us use five knots. For categorical predictors, the maximum number of degrees of freedom needed would be one fewer than the number of categories. For pf we could lump the last two categories since the last category has only 2 patients. Likewise, we could combine the last two levels of ekg. Table 8.1 lists candidate predictors with the maximum number of parameters we consider for each.

Table 8.1: Degrees of fredom needed for predictors
Predictor: rx age wt pf hx sbp dbp ekg hg sz sg ap bm
Number of Parameters: 3 2 4 2 1 4 4 5 4 4 4 4 1

## 8.3 Redundancy Analysis

As described in Section Section 4.7.1, it is occasionally useful to do a rigorous redundancy analysis on a set of potential predictors. Let us run the algorithm discussed there, on the set of predictors we are considering. We will use a low threshold (0.3) for $$R^2$$ for demonstration purposes.

Code
# Allow only 1 d.f. for three of the predictors
prostate <-
transform(prostate,
ekg.norm = 1*(ekg %in% c("normal","benign")),
rxn = as.numeric(rx),
pfn = as.numeric(pf))
# Force pfn, rxn to be linear because of difficulty of placing
# knots with so many ties in the data
# Note: all incomplete cases are deleted (inefficient)
redun(~ stage + I(rxn) + age + wt + I(pfn) + hx +
sbp + dbp + ekg.norm + hg + sz + sg + ap + bm,
r2=.3, type='adjusted', data=prostate)

Redundancy Analysis

redun(formula = ~stage + I(rxn) + age + wt + I(pfn) + hx + sbp +
dbp + ekg.norm + hg + sz + sg + ap + bm, data = prostate,
r2 = 0.3, type = "adjusted")

n: 483  p: 14   nk: 3

Number of NAs:   19
Frequencies of Missing Values Due to Each Variable
stage   I(rxn)      age       wt   I(pfn)       hx      sbp      dbp
0        0        1        2        0        0        0        0
ekg.norm       hg       sz       sg       ap       bm
0        0        5       11        0        0

Transformation of target variables forced to be linear

R-squared cutoff: 0.3   Type: adjusted

R^2 with which each variable can be predicted from all other variables:

stage   I(rxn)      age       wt   I(pfn)       hx      sbp      dbp
0.658    0.000    0.073    0.111    0.156    0.062    0.452    0.417
ekg.norm       hg       sz       sg       ap       bm
0.055    0.146    0.192    0.540    0.147    0.391

Rendundant variables:

stage sbp bm sg

Predicted from variables:

I(rxn) age wt I(pfn) hx dbp ekg.norm hg sz ap

Variable Deleted   R^2 R^2 after later deletions
1            stage 0.658         0.658 0.646 0.494
2              sbp 0.452               0.453 0.455
3               bm 0.374                     0.367
4               sg 0.342                          

By any reasonable criterion on $$R^2$$, none of the predictors is redundant. stage can be predicted with an $$R^{2} = 0.658$$ from the other 13 variables, but only with $$R^{2} = 0.493$$ after deletion of 3 variables later declared to be “redundant.”

## 8.4 Variable Clustering

From Table 8.1, the total number of parameters is 42, so some data reduction should be considered. We resist the temptation to take the “easy way out” using stepwise variable selection so that we can achieve a more stable modeling process 1 and obtain unbiased standard errors. Before using a variable clustering procedure, note that ap is extremely skewed. To handle skewness, we use Spearman rank correlations for continuous variables (later we transform each variable using transcan, which will allow ordinary correlation coefficients to be used). After classifying ekg as “normal/benign” versus everything else, the Spearman correlations are plotted below.

• 1 Sauerbrei & Schumacher (1992) used the bootstrap to demonstrate the variability of a standard variable selection procedure for the prostate cancer dataset.

• Code
x <- with(prostate,
cbind(stage, rx, age, wt, pf, hx, sbp, dbp,
ekg.norm, hg, sz, sg, ap, bm))
# If no missing data, could use cor(apply(x, 2, rank))
r <- rcorr(x, type="spearman")$r # rcorr in Hmisc maxabsr <- max(abs(r[row(r) != col(r)])) Code plotCorrM(r)[] # An Hmisc function Figure 8.1: Matrix of Spearman $$\rho$$ rank correlation coefficients between predictors. Horizontal gray scale lines correspond to $$\rho=0$$. The tallest bar corresponds to $$|\rho|=0.785$$. We perform a hierarchical cluster analysis based on a similarity matrix that contains pairwise Hoeffding $$D$$ statistics $$D$$ will detect nonmonotonic associations. Code vc <- varclus(~ stage + rxn + age + wt + pfn + hx + sbp + dbp + ekg.norm + hg + sz + sg + ap + bm, sim='hoeffding', data=prostate) plot(vc) Figure 8.2: Hierarchical clustering using Hoeffding’s $$D$$ as a similarity measure. Dummy variables were used for the categorical variable ekg. Some of the dummy variables cluster together since they are by definition negatively correlated. We combine sbp and dbp, and tentatively combine ap, sg, sz, and bm. ## 8.5 Transformation and Single Imputation Using transcan Now we turn to the scoring of the predictors to potentially reduce the number of regression parameters that are needed later by doing away with the need for nonlinear terms and multiple dummy variables. The R Hmisc package transcan function defaults to using a maximum generalized variance method that incorporates canonical variates to optimally transform both sides of a multiple regression model. Each predictor is treated in turn as a variable being predicted, and all variables are expanded into restricted cubic splines (for continuous variables) or dummy variables (for categorical ones). Code spar(bot=1) # Combine 2 levels of ekg (one had freq. 1) levels(prostate$ekg)[levels(prostate$ekg) %in% c('old MI', 'recent MI')] <- 'MI' prostate$pf.coded <- as.integer(prostate$pf) # make a numeric version; combine last 2 levels of original levels(prostate$pf) <- levels(prostate$pf)[c(1,2,3,3)] ptrans <- transcan(~ sz + sg + ap + sbp + dbp + age + wt + hg + ekg + pf + bm + hx, imputed=TRUE, transformed=TRUE, trantab=TRUE, pl=FALSE, show.na=TRUE, data=prostate, frac=.1, pr=FALSE) summary(ptrans, digits=4) transcan(x = ~sz + sg + ap + sbp + dbp + age + wt + hg + ekg + pf + bm + hx, imputed = TRUE, trantab = TRUE, transformed = TRUE, pr = FALSE, pl = FALSE, show.na = TRUE, data = prostate, frac = 0.1) Iterations: 8 R-squared achieved in predicting each variable: sz sg ap sbp dbp age wt hg ekg pf bm hx 0.207 0.556 0.573 0.498 0.485 0.095 0.122 0.158 0.092 0.113 0.349 0.108 Adjusted R-squared: sz sg ap sbp dbp age wt hg ekg pf bm hx 0.180 0.541 0.559 0.481 0.468 0.065 0.093 0.129 0.059 0.086 0.331 0.083 Coefficients of canonical variates for predicting each (row) variable sz sg ap sbp dbp age wt hg ekg pf bm hx sz 0.66 0.20 0.33 0.33 -0.01 -0.01 0.11 0.11 0.03 -0.36 0.34 sg 0.23 0.84 0.08 0.07 -0.02 0.01 -0.01 -0.07 0.02 -0.20 0.14 ap 0.07 0.80 -0.11 -0.05 0.03 -0.02 0.01 0.01 0.00 -0.83 -0.03 sbp 0.13 0.10 -0.14 -0.94 0.14 -0.09 0.03 0.10 0.10 -0.03 -0.14 dbp 0.13 0.09 -0.06 -0.98 0.14 0.07 0.05 0.03 0.04 0.03 -0.01 age -0.02 -0.06 0.18 0.58 0.57 0.14 0.46 0.43 -0.03 1.05 -0.76 wt -0.02 0.06 -0.08 -0.31 0.23 0.12 0.51 -0.06 0.21 -1.09 0.27 hg 0.13 -0.02 0.03 0.09 0.15 0.33 0.43 -0.02 0.24 -1.53 -0.12 ekg 0.20 -0.38 0.10 0.42 0.12 0.41 -0.04 -0.04 0.15 -0.42 -1.23 pf 0.04 0.08 0.02 0.36 0.14 -0.03 0.22 0.29 0.13 -1.75 -0.46 bm -0.02 -0.03 -0.13 0.00 0.00 0.03 -0.04 -0.06 -0.01 -0.06 -0.02 hx 0.04 0.05 -0.01 -0.04 0.00 -0.06 0.02 -0.01 -0.09 -0.04 -0.05 Summary of imputed values sz n missing distinct Info Mean Gmd 5 0 4 0.95 12.86 10.31 Value 6.000 7.416 20.180 24.690 Frequency 2 1 1 1 Proportion 0.4 0.2 0.2 0.2 sg n missing distinct Info Mean Gmd .05 .10 11 0 10 0.995 10.1 3.205 6.900 7.289 .25 .50 .75 .90 .95 7.697 10.270 10.560 15.000 15.000 lowest : 6.511 7.289 7.394 8.000 10.250, highest: 10.270 10.320 10.390 10.730 15.000 Value 6.511 7.289 7.394 8.000 10.250 10.270 10.320 10.390 10.730 Frequency 1 1 1 1 1 1 1 1 1 Proportion 0.091 0.091 0.091 0.091 0.091 0.091 0.091 0.091 0.091 Value 15.000 Frequency 2 Proportion 0.182 age n missing distinct Info Mean Gmd 1 0 1 0 71.65 NA Value 71.65 Frequency 1 Proportion 1 wt n missing distinct Info Mean Gmd 2 0 2 1 97.77 13.06 Value 91.24 104.30 Frequency 1 1 Proportion 0.5 0.5 ekg n missing distinct Info Mean Gmd 8 0 4 0.905 2.625 1.75 Value 1 3 4 5 Frequency 3 3 1 1 Proportion 0.375 0.375 0.125 0.125 Starting estimates for imputed values: sz sg ap sbp dbp age wt hg ekg pf bm hx 11.0 10.0 0.7 14.0 8.0 73.0 98.0 13.7 1.0 1.0 0.0 0.0  Code ggplot(ptrans, scale=TRUE) + theme(axis.text.x=element_text(size=6)) Figure 8.3: Simultaneous transformation and single imputation of all candidate predictors using transcan. Imputed values are shown as red plus signs. Transformed values are arbitrarily scaled to $$[0,1]$$. Note that at face value the transformation of ap was derived in a circular manner, since the combined index of stage and histologic grade, sg, uses in its stage component a cutoff on ap. However, if sg is omitted from consideration, the resulting transformation for ap does not change appreciably. Note that bm and hx are represented as binary variables, so their coefficients in the table of canonical variable coefficients are on a different scale. For the variables that were actually transformed, the coefficients are for standardized transformed variables (mean 0, variance 1). From examining the $$R^2$$s, age, wt, ekg, pf, and hx are not strongly related to other variables. Imputations for age, wt, ekg are thus relying more on the median or modal values from the marginal distributions. From the coefficients of first (standardized) canonical variates, sbp is predicted almost solely from dbp; bm is predicted mainly from ap, hg, and pf. 2 • 2 Schemper & Heinze (1997) used logistic models to impute dichotomizations of the predictors for this dataset. • ## 8.6 Data Reduction Using Principal Components The first PC, PC$$_{1}$$, is the linear combination of standardized variables having maximum variance. PC$$_{2}$$ is the linear combination of predictors having the second largest variance such that PC$$_{2}$$ is orthogonal to (uncorrelated with) PC$$_{1}$$. If there are $$p$$ raw variables, the first $$k$$ PCs, where $$k < p$$, will explain only part of the variation in the whole system of $$p$$ variables unless one or more of the original variables is exactly a linear combination of the remaining variables. Note that it is common to scale and center variables to have mean zero and variance 1 before computing PCs. The response variable (here, time until death due to any cause) is not examined during data reduction, so that if PCs are selected by variance explained in the $$X$$-space and not by variation explained in $$Y$$, one needn’t correct for model uncertainty or multiple comparisons. PCA results in data reduction when the analyst uses only a subset of the $$p$$ possible PCs in predicting $$Y$$. This is called incomplete principal component regression. When one sequentially enters PCs into a predictive model in a strict pre-specified order (i.e., by descending amounts of variance explained for the system of $$p$$ variables), model uncertainty requiring bootstrap adjustment is minimized. In contrast, model uncertainty associated with stepwise regression (driven by associations with $$Y$$) is massive. For the prostate dataset, consider PCs on raw candidate predictors, expanding polytomous factors using dummy variables. The R function princomp is used, after singly imputing missing raw values using transcan’s optimal additive nonlinear models. In this series of analyses we ignore the treatment variable, rx. Code spar(top=1) # Impute all missing values in all variables given to transcan imputed <- impute(ptrans, data=prostate, list.out=TRUE)  Imputed missing values with the following frequencies and stored them in variables with their original names: sz sg age wt ekg 5 11 1 2 8  Code imputed <- as.data.frame(imputed) # Compute principal components on imputed data. # Create a design matrix from ekg categories Ekg <- model.matrix(~ ekg, data=imputed)[, -1] # Use correlation matrix pfn <- prostate$pfn
prin.raw <- princomp(~ sz + sg + ap + sbp + dbp + age +
wt + hg + Ekg + pfn + bm + hx,
cor=TRUE, data=imputed)

plot(prin.raw, type='lines', main='', ylim=c(0,3))
# Add cumulative fraction of variance explained
addscree <- function(x, npcs=min(10, length(x$sdev)), plotv=FALSE, col=1, offset=.8, adj=0, pr=FALSE) { vars <- x$sdev^2
cumv <- cumsum(vars)/sum(vars)
if(pr) print(cumv)
text(1:npcs, vars[1:npcs] + offset*par('cxy'),
as.character(round(cumv[1:npcs], 2)),
if(plotv) lines(1:npcs, vars[1:npcs], type='b', col=col)
}
prin.trans <- princomp(ptrans$transformed, cor=TRUE) addscree(prin.trans, npcs=10, plotv=TRUE, col='red', offset=-.8, adj=1) Figure 8.4: Variance of the system of raw predictors (black) explained by individual principal components (lines) along with cumulative proportion of variance explained (text), and variance explained by components computed on transcan-transformed variables (red) The resulting plot shown in Figure 8.4 is called a “scree” plot [Jolliffe (2010); pp. 96–99, 104, 106]. It shows the variation explained by the first $$k$$ principal components as $$k$$ increases all the way to 16 parameters (no data reduction). It requires 10 of the 16 possible components to explain $$> 0.8$$ of the variance, and the first 5 components explain $$0.49$$ of the variance of the system. Two of the 16 dimensions are almost totally redundant. After repeating this process when transforming all predictors via transcan, we have only 12 degrees of freedom for the 12 predictors. The variance explained is depicted in Figure 8.4 in red. It requires at least 9 of the 12 possible components to explain $$\geq 0.9$$ of the variance, and the first 5 components explain $$0.66$$ of the variance as opposed to $$0.49$$ for untransformed variables. Let us see how the PCs “explain” the times until death using the Cox regression function from rms, cph, described in Chapter 20. In what follows we vary the number of components used in the Cox models from 1 to all 16, computing the AIC for each model. AIC is related to model log likelihood penalized for number of parameters estimated, and lower is better. For reference, the AIC of the model using all of the original predictors, and the AIC of a full additive spline model are shown as horizontal lines. Code require(rms) spar(bty='l') S <- with(prostate, Surv(dtime, status != "alive")) # two-column response var. pcs <- prin.raw$scores         # pick off all PCs
aic <- numeric(16)
for(i in 1:16) {
ps <- pcs[,1:i]
aic[i] <- AIC(cph(S ~ ps))
}
plot(1:16, aic, xlab='Number of Components Used',
ylab='AIC', type='l', ylim=c(3950,4000))
f <- cph(S ~ sz + sg + log(ap) + sbp + dbp + age + wt + hg +
ekg + pf + bm + hx, data=imputed)
abline(h=AIC(f), col='blue')
## The following model in the 2nd edition no longer converges
# f <- cph(S ~ rcs(sz,5) + rcs(sg,5) + rcs(log(ap),5) +
#          rcs(sbp,5) + rcs(dbp,5) + rcs(age,3) + rcs(wt,5) +
#          rcs(hg,5) + ekg + pf + bm + hx,
#          tol=1e-14, data=imputed)
f <- cph(S ~ rcs(sz,4) + rcs(sg,4) + rcs(log(ap),5) +
rcs(sbp,4) + rcs(dbp,4) + rcs(age,3) + rcs(wt,4) +
rcs(hg,4) + ekg + pf + bm + hx,
tol=1e-14, data=imputed)
abline(h=AIC(f), col='blue', lty=2) Figure 8.5: AIC of Cox models fitted with progressively more principal components. The solid blue line depicts the AIC of the model with all original covariates. The dotted blue line is positioned at the AIC of the full spline model.

For the money, the first 5 components adequately summarizes all variables, if linearly transformed, and the full linear model is no better than this. The model allowing all continuous predictors to be nonlinear is not worth its added degrees of freedom.

Next check the performance of a model derived from cluster scores of transformed variables.

Code
# Compute PC1 on a subset of transcan-transformed predictors
pco <- function(v) {
f <- princomp(ptrans$transformed[,v], cor=TRUE) vars <- f$sdev^2
cat('Fraction of variance explained by PC1:',
round(vars/sum(vars),2), '\n')
f$scores[,1] } tumor <- pco(c('sz','sg','ap','bm')) Fraction of variance explained by PC1: 0.59  Code bp <- pco(c('sbp','dbp')) Fraction of variance explained by PC1: 0.84  Code cardiac <- pco(c('hx','ekg')) Fraction of variance explained by PC1: 0.61  Code # Get transformed individual variables that are not clustered other <- ptrans$transformed[,c('hg','age','pf','wt')]
f <- cph(S ~ tumor + bp + cardiac + other)  # other is matrix
AIC(f)
 3954.393
Code
print(f, latex=TRUE, long=FALSE, title='')
Model Tests Discrimination
Indexes
Obs 502 LR χ2 81.11 R2 0.149
Events 354 d.f. 7 R27,502 0.137
Center 0 Pr(>χ2) 0.0000 R27,354 0.189
Score χ2 86.81 Dxy 0.286
Pr(>χ2) 0.0000
β S.E. Wald Z Pr(>|Z|)
tumor  -0.1723  0.0367 -4.69 <0.0001
bp   0.0251  0.0424 0.59 0.5528
cardiac   0.2513  0.0516 4.87 <0.0001
hg  -0.1407  0.0554 -2.54 0.0111
age  -0.1034  0.0579 -1.79 0.0739
pf  -0.0933  0.0487 -1.92 0.0551
wt  -0.0910  0.0555 -1.64 0.1012

The tumor and cardiac clusters seem to dominate prediction of mortality, and the AIC of the model built from cluster scores of transformed variables compares favorably with other models (Figure 8.5).

### 8.6.1 Sparse Principal Components

A disadvantage of principal components is that every predictor receives a nonzero weight for every component, so many coefficients are involved even through the effective degrees of freedom with respect to the response model are reduced. Sparse principal components uses a penalty function to reduce the magnitude of the loadings variables receive in the components. If an L1 penalty is used (as with the lasso, some loadings are shrunk to zero, resulting in some simplicity. Sparse principal components combines some elements of variable clustering, scoring of variables within clusters, and redundancy analysis.

Filzmoser et al. (2012) have written a nice R package pcaPP for doing sparse PC analysis.3 The following example uses the prostate data again. To allow for nonlinear transformations and to score the ekg variable in the prostate dataset down to a scalar, we use the transcan-transformed predictors as inputs.

• 3 The spca package is a new sparse PC package that should also be considered.

• Code
require(pcaPP)
spar()
s <- sPCAgrid(ptrans$transformed, k=10, method='sd', center=mean, scale=sd, scores=TRUE, maxiter=10) plot(s, type='lines', main='', ylim=c(0,3)) addscree(s) lo <- unclass(s$loadings)
flo <- format(round(lo, 2), zero.print=FALSE)
print(flo, quote=FALSE)    # These loadings are on the orig. transcan scale
    Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 Comp.9 Comp.10
sz   0.25                                                    0.95   0.04
sg   0.62                                                   -0.07   0.52
ap   0.63                                                   -0.30
sbp        -0.71
dbp         0.71
age                       1.00
wt                                             1.00
hg                                      1.00
ekg                1.00
pf                                                    1.00
bm  -0.39                                                           0.85
hx                               1.00 Figure 8.6: Variance explained by individual sparse principal components (lines) along with cumulative proportion of variance explained (text)

Only nonzero loadings are shown. The first sparse PC is the tumor cluster used above, and the second is the blood pressure cluster. Let us see how well incomplete sparse principal component regression predicts time until death.

Code
spar(bty='l')
pcs <- s\$scores         # pick off sparse PCs
aic <- numeric(10)
for(i in 1:10) {
ps <- pcs[,1:i]
aic[i] <- AIC(cph(S ~ ps))
}
plot(1:10, aic, xlab='Number of Components Used',
ylab='AIC', type='l',  ylim=c(3950,4000)) Figure 8.7: Performance of sparse principal components in Cox models

More components are required to optimize AIC than were seen in Figure 8.5, but a model built from 6–8 sparse PCs performed as well as the other models.

## 8.7 Transformation Using Nonparametric Smoothers

The ACE nonparametric additive regression method of Breiman & Friedman (1985) transforms both the left-hand-side variable and all the right-hand-side variables so as to optimize $$R^{2}$$. ACE can be used to transform the predictors using the R ace function in the acepack package, called by the transace function in the Hmisc package. transace does not impute data but merely does casewise deletion of missing values. Here transace is run after single imputation by transcan. binary is used to tell transace which variables not to try to predict (because they need no transformation). Several predictors are restricted to be monotonically transformed.

Code
spar(mfrow=c(4,3), left=-2)
x <- with(imputed,
cbind(sz, sg, ap, sbp, dbp, age, wt, hg, ekg, pf,
bm, hx))
monotonic <- c("sz","sg","ap","sbp","dbp","age","pf")
transace(x, monotonic,
categorical="ekg", binary=c("bm","hx"))
R-squared achieved in predicting each variable:

sz        sg        ap       sbp       dbp       age        wt        hg
0.2279024 0.5748361 0.5728639 0.4823925 0.4579628 0.1514941 0.1718541 0.2020425
ekg        pf        bm        hx
0.1113341 0.1774896        NA        NA Figure 8.8: Simultaneous transformation of all variables using ACE.

Except for ekg, age, and for arbitrary sign reversals, the transformations in Figure 8.8 determined using transace were similar to those in Figure 8.3. The transcan transformation for ekg` makes more sense.