8  Case Study in Data Reduction

Recall that the aim of data reduction is to reduce (without using the outcome) the number of parameters needed in the outcome model. The following case study illustrates these techniques:

  1. redundancy analysis;
  2. variable clustering;
  3. data reduction using principal component analysis (PCA), sparse PCA, and pre-transformations;
  4. restricted cubic spline fitting using ordinary least squares, in the context of scaling; and
  5. scaling/variable transformations using canonical variates and nonparametric additive regression.

8.1 Data

Consider the 506-patient prostate cancer dataset from Byar & Green (1980). The data are listed in [Andrews & Herzberg (1985); Table 46] and are available at https://hbiostat.org/data. These data were from a randomized trial comparing four treatments for stage 3 and 4 prostate cancer, with almost equal numbers of patients on placebo and each of three doses of estrogen. Four patients had missing values on all of the following variables: wt, pf, hx, sbp, dbp, ekg, hg, bm; two of these patients were also missing sz. These patients are excluded from consideration. The ultimate goal of an analysis of the dataset might be to discover patterns in survival or to do an analysis of covariance to assess the effect of treatment while adjusting for patient heterogeneity. See Chapter Chapter 21 for such analyses. The data reductions developed here are general and can be used for a variety of dependent variables.

The variable names, labels, and a summary of the data are printed below.

Code
require(Hmisc)
getHdata(prostate)  # Download and make prostate accessible
# Convert an old date format to R format
prostate$sdate <- as.Date(prostate$sdate)
d <- describe(prostate[2:17])
html(d)
prostate[2:17] Descriptives
prostate[2:17]

16 Variables   502 Observations

stage: Stage
nmissingdistinctInfoMeanGmd
502020.7333.4240.4895
 Value          3     4
 Frequency    289   213
 Proportion 0.576 0.424
 

rx
image
nmissingdistinct
50204
 Value              placebo 0.2 mg estrogen 1.0 mg estrogen 5.0 mg estrogen
 Frequency              127             124             126             125
 Proportion           0.253           0.247           0.251           0.249
 

dtime: Months of Follow-up
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
502076136.1326.89 1.05 5.0014.2534.0057.7567.0071.00
lowest : 0 1 2 3 4 , highest: 72 73 74 75 76
status
image
nmissingdistinct
502010
lowest :alive dead - prostatic ca dead - heart or vascular dead - cerebrovascular dead - pulmonary embolus
highest:dead - other ca dead - respiratory disease dead - other specific non-cadead - unspecified non-ca dead - unknown cause

age: Age in Years
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
5011410.99671.467.49756607073767880
lowest : 48 49 50 51 52 , highest: 84 85 87 88 89
wt: Weight Index = wt(kg)-ht(cm)+200
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
5002670.99999.0314.93 77.95 82.90 90.00 98.00107.00116.00123.00
lowest : 69 71 72 73 74 , highest: 136 142 145 150 152
pf
image
nmissingdistinct
50204
 Value           normal activity in bed < 50% daytime in bed > 50% daytime
 Frequency                   450                   37                   13
 Proportion                0.896                0.074                0.026
                                
 Value           confined to bed
 Frequency                     2
 Proportion                0.004
 

hx: History of Cardiovascular Disease
nmissingdistinctInfoSumMeanGmd
502020.7332130.42430.4895

sbp: Systolic Blood Pressure/10
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
5020180.9814.352.59611121314161718
lowest : 8 9 10 11 12 , highest: 21 22 23 24 30
 Value          8     9    10    11    12    13    14    15    16    17    18    19
 Frequency      1     3    14    27    65    74    98    74    72    34    17    12
 Proportion 0.002 0.006 0.028 0.054 0.129 0.147 0.195 0.147 0.143 0.068 0.034 0.024
                                               
 Value         20    21    22    23    24    30
 Frequency      3     2     3     1     1     1
 Proportion 0.006 0.004 0.006 0.002 0.002 0.002
 

dbp: Diastolic Blood Pressure/10
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
5020120.9458.1491.553 6 6 7 8 91010
lowest : 4 5 6 7 8 , highest: 11 12 13 14 18
 Value          4     5     6     7     8     9    10    11    12    13    14    18
 Frequency      4     5    43   107   165    94    66     9     5     2     1     1
 Proportion 0.008 0.010 0.086 0.213 0.329 0.187 0.131 0.018 0.010 0.004 0.002 0.002
 

ekg
image
nmissingdistinct
49487
lowest :normal benign rhythmic disturb & electrolyte chheart block or conduction def heart strain
highest:rhythmic disturb & electrolyte chheart block or conduction def heart strain old MI recent MI

hg: Serum Hemoglobin (g/100ml)
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
502091113.452.1610.210.712.313.714.715.816.4
lowest : 5.899414 7.000000 7.199219 7.799805 8.199219
highest:17.29687517.50000017.59765618.19921921.199219

sz: Size of Primary Tumor (cm2)
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
4975550.99814.6313.05 2.0 3.0 5.011.021.032.039.2
lowest : 0 1 2 3 4 , highest: 54 55 61 62 69
sg: Combined Index of Stage and Hist. Grade
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
49111110.95910.312.245 8 8 910111313
lowest : 5 6 7 8 9 , highest: 11 12 13 14 15
 Value          5     6     7     8     9    10    11    12    13    14    15
 Frequency      3     8     7    67   137    33   114    26    75     5    16
 Proportion 0.006 0.016 0.014 0.136 0.279 0.067 0.232 0.053 0.153 0.010 0.033
 

ap: Serum Prostatic Acid Phosphatase
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
50201280.99612.1821.71 0.300 0.300 0.500 0.700 2.97521.68938.470
lowest : 0.09999084 0.19998169 0.29998779 0.39996338 0.50000000
highest:316.00000000353.50000000367.00000000596.00000000999.87500000
 Value          0    10    20    30    40    50    60    80   100   130   150   160
 Frequency    406    38    13    14    13     3     2     1     1     1     1     1
 Proportion 0.809 0.076 0.026 0.028 0.026 0.006 0.004 0.002 0.002 0.002 0.002 0.002
                                                           
 Value        180   230   280   320   350   370   600  1000
 Frequency      1     1     1     1     1     1     1     1
 Proportion 0.002 0.002 0.002 0.002 0.002 0.002 0.002 0.002
 
For the frequency table, variable is rounded to the nearest 10
bm: Bone Metastases
nmissingdistinctInfoSumMeanGmd
502020.41820.16330.2739

stage is defined by ap as well as X-ray results. Of the patients in stage 3, 0.92 have ap \(\leq\) 0.8. Of those in stage 4, 0.93 have ap > 0.8. Since stage can be predicted almost certainly from ap, we do not consider stage in some of the analyses.

8.2 How Many Parameters Can Be Estimated?

There are 354 deaths among the 502 patients. If predicting survival time were of major interest, we could develop a reliable model if no more than about \(354/15 = 24\) parameters were examined against \(Y\) in unpenalized modeling. Suppose that a full model with no interactions is fitted and that linearity is not assumed for any continuous predictors. Assuming age is almost linear, we could fit a restricted cubic spline function with three knots. For the other continuous variables, let us use five knots. For categorical predictors, the maximum number of degrees of freedom needed would be one fewer than the number of categories. For pf we could lump the last two categories since the last category has only 2 patients. Likewise, we could combine the last two levels of ekg. Table 8.1 lists candidate predictors with the maximum number of parameters we consider for each.

Table 8.1: Degrees of fredom needed for predictors
Predictor: rx age wt pf hx sbp dbp ekg hg sz sg ap bm
Number of Parameters: 3 2 4 2 1 4 4 5 4 4 4 4 1

8.3 Redundancy Analysis

As described in Section Section 4.7.1, it is occasionally useful to do a rigorous redundancy analysis on a set of potential predictors. Let us run the algorithm discussed there, on the set of predictors we are considering. We will use a low threshold (0.3) for \(R^2\) for demonstration purposes.

Code
# Allow only 1 d.f. for three of the predictors
prostate <-
  transform(prostate,
            ekg.norm = 1*(ekg %in% c("normal","benign")),
            rxn = as.numeric(rx),
            pfn = as.numeric(pf))
# Force pfn, rxn to be linear because of difficulty of placing
# knots with so many ties in the data
# Note: all incomplete cases are deleted (inefficient)
redun(~ stage + I(rxn) + age + wt + I(pfn) + hx +
      sbp + dbp + ekg.norm + hg + sz + sg + ap + bm,
      r2=.3, type='adjusted', data=prostate)

Redundancy Analysis

redun(formula = ~stage + I(rxn) + age + wt + I(pfn) + hx + sbp + 
    dbp + ekg.norm + hg + sz + sg + ap + bm, data = prostate, 
    r2 = 0.3, type = "adjusted")

n: 483  p: 14   nk: 3 

Number of NAs:   19 
Frequencies of Missing Values Due to Each Variable
   stage   I(rxn)      age       wt   I(pfn)       hx      sbp      dbp 
       0        0        1        2        0        0        0        0 
ekg.norm       hg       sz       sg       ap       bm 
       0        0        5       11        0        0 


Transformation of target variables forced to be linear

R-squared cutoff: 0.3   Type: adjusted 

R^2 with which each variable can be predicted from all other variables:

   stage   I(rxn)      age       wt   I(pfn)       hx      sbp      dbp 
   0.658    0.000    0.073    0.111    0.156    0.062    0.452    0.417 
ekg.norm       hg       sz       sg       ap       bm 
   0.055    0.146    0.192    0.540    0.147    0.391 

Rendundant variables:

stage sbp bm sg

Predicted from variables:

I(rxn) age wt I(pfn) hx dbp ekg.norm hg sz ap 

  Variable Deleted   R^2 R^2 after later deletions
1            stage 0.658         0.658 0.646 0.494
2              sbp 0.452               0.453 0.455
3               bm 0.374                     0.367
4               sg 0.342                          

By any reasonable criterion on \(R^2\), none of the predictors is redundant. stage can be predicted with an \(R^{2} = 0.658\) from the other 13 variables, but only with \(R^{2} = 0.493\) after deletion of 3 variables later declared to be “redundant.”

8.4 Variable Clustering

From Table 8.1, the total number of parameters is 42, so some data reduction should be considered. We resist the temptation to take the “easy way out” using stepwise variable selection so that we can achieve a more stable modeling process 1 and obtain unbiased standard errors. Before using a variable clustering procedure, note that ap is extremely skewed. To handle skewness, we use Spearman rank correlations for continuous variables (later we transform each variable using transcan, which will allow ordinary correlation coefficients to be used). After classifying ekg as “normal/benign” versus everything else, the Spearman correlations are plotted below.

  • 1 Sauerbrei & Schumacher (1992) used the bootstrap to demonstrate the variability of a standard variable selection procedure for the prostate cancer dataset.

  • Code
    x <- with(prostate,
              cbind(stage, rx, age, wt, pf, hx, sbp, dbp,
                    ekg.norm, hg, sz, sg, ap, bm))
    # If no missing data, could use cor(apply(x, 2, rank))
    r <- rcorr(x, type="spearman")$r    # rcorr in Hmisc
    maxabsr <- max(abs(r[row(r) != col(r)]))
    Code
    plotCorrM(r)[[1]]    # An Hmisc function

    Figure 8.1: Matrix of Spearman \(\rho\) rank correlation coefficients between predictors. Horizontal gray scale lines correspond to \(\rho=0\). The tallest bar corresponds to \(|\rho|=0.785\).

    We perform a hierarchical cluster analysis based on a similarity matrix that contains pairwise Hoeffding \(D\) statistics (Hoeffding, 1948) \(D\) will detect nonmonotonic associations.

    Code
    vc <- varclus(~ stage + rxn + age + wt + pfn + hx +
                  sbp + dbp + ekg.norm + hg + sz + sg + ap + bm,
                  sim='hoeffding', data=prostate)
    plot(vc)

    Figure 8.2: Hierarchical clustering using Hoeffding’s \(D\) as a similarity measure. Dummy variables were used for the categorical variable `ekg. Some of the dummy variables cluster together since they are by definition negatively correlated.

    We combine sbp and dbp, and tentatively combine ap, sg, sz, and bm.

    8.5 Transformation and Single Imputation Using transcan

    Now we turn to the scoring of the predictors to potentially reduce the number of regression parameters that are needed later by doing away with the need for nonlinear terms and multiple dummy variables. The R Hmisc package transcan function defaults to using a maximum generalized variance method (Kuhfeld, 2009) that incorporates canonical variates to optimally transform both sides of a multiple regression model. Each predictor is treated in turn as a variable being predicted, and all variables are expanded into restricted cubic splines (for continuous variables) or dummy variables (for categorical ones).

    Code
    spar(bot=1)
    # Combine 2 levels of ekg (one had freq. 1)
    levels(prostate$ekg)[levels(prostate$ekg) %in%
                         c('old MI', 'recent MI')] <- 'MI'
    
    prostate$pf.coded <- as.integer(prostate$pf)
    # make a numeric version; combine last 2 levels of original
    levels(prostate$pf) <- levels(prostate$pf)[c(1,2,3,3)]
    
    ptrans <-
      transcan(~ sz + sg + ap + sbp + dbp +
               age + wt + hg + ekg + pf + bm + hx, imputed=TRUE,
               transformed=TRUE, trantab=TRUE, pl=FALSE,
               show.na=TRUE, data=prostate, frac=.1, pr=FALSE)
    summary(ptrans, digits=4)
    transcan(x = ~sz + sg + ap + sbp + dbp + age + wt + hg + ekg + 
        pf + bm + hx, imputed = TRUE, trantab = TRUE, transformed = TRUE, 
        pr = FALSE, pl = FALSE, show.na = TRUE, data = prostate, 
        frac = 0.1)
    
    Iterations: 8 
    
    R-squared achieved in predicting each variable:
    
       sz    sg    ap   sbp   dbp   age    wt    hg   ekg    pf    bm    hx 
    0.207 0.556 0.573 0.498 0.485 0.095 0.122 0.158 0.092 0.113 0.349 0.108 
    
    Adjusted R-squared:
    
       sz    sg    ap   sbp   dbp   age    wt    hg   ekg    pf    bm    hx 
    0.180 0.541 0.559 0.481 0.468 0.065 0.093 0.129 0.059 0.086 0.331 0.083 
    
    Coefficients of canonical variates for predicting each (row) variable
    
        sz    sg    ap    sbp   dbp   age   wt    hg    ekg   pf    bm    hx   
    sz         0.66  0.20  0.33  0.33 -0.01 -0.01  0.11  0.11  0.03 -0.36  0.34
    sg   0.23        0.84  0.08  0.07 -0.02  0.01 -0.01 -0.07  0.02 -0.20  0.14
    ap   0.07  0.80       -0.11 -0.05  0.03 -0.02  0.01  0.01  0.00 -0.83 -0.03
    sbp  0.13  0.10 -0.14       -0.94  0.14 -0.09  0.03  0.10  0.10 -0.03 -0.14
    dbp  0.13  0.09 -0.06 -0.98        0.14  0.07  0.05  0.03  0.04  0.03 -0.01
    age -0.02 -0.06  0.18  0.58  0.57        0.14  0.46  0.43 -0.03  1.05 -0.76
    wt  -0.02  0.06 -0.08 -0.31  0.23  0.12        0.51 -0.06  0.21 -1.09  0.27
    hg   0.13 -0.02  0.03  0.09  0.15  0.33  0.43       -0.02  0.24 -1.53 -0.12
    ekg  0.20 -0.38  0.10  0.42  0.12  0.41 -0.04 -0.04        0.15 -0.42 -1.23
    pf   0.04  0.08  0.02  0.36  0.14 -0.03  0.22  0.29  0.13       -1.75 -0.46
    bm  -0.02 -0.03 -0.13  0.00  0.00  0.03 -0.04 -0.06 -0.01 -0.06       -0.02
    hx   0.04  0.05 -0.01 -0.04  0.00 -0.06  0.02 -0.01 -0.09 -0.04 -0.05      
    
    Summary of imputed values
    
    sz 
           n  missing distinct     Info     Mean      Gmd 
           5        0        4     0.95    12.86    10.31 
                                          
    Value       6.000  7.416 20.180 24.690
    Frequency       2      1      1      1
    Proportion    0.4    0.2    0.2    0.2
    sg 
           n  missing distinct     Info     Mean      Gmd      .05      .10 
          11        0       10    0.995     10.1    3.205    6.900    7.289 
         .25      .50      .75      .90      .95 
       7.697   10.270   10.560   15.000   15.000 
    
    lowest :  6.511  7.289  7.394  8.000 10.250, highest: 10.270 10.320 10.390 10.730 15.000
                                                                             
    Value       6.511  7.289  7.394  8.000 10.250 10.270 10.320 10.390 10.730
    Frequency       1      1      1      1      1      1      1      1      1
    Proportion  0.091  0.091  0.091  0.091  0.091  0.091  0.091  0.091  0.091
                     
    Value      15.000
    Frequency       2
    Proportion  0.182
    age 
           n  missing distinct     Info     Mean      Gmd 
           1        0        1        0    71.65       NA 
                    
    Value      71.65
    Frequency      1
    Proportion     1
    wt 
           n  missing distinct     Info     Mean      Gmd 
           2        0        2        1    97.77    13.06 
                            
    Value       91.24 104.30
    Frequency       1      1
    Proportion    0.5    0.5
    ekg 
           n  missing distinct     Info     Mean      Gmd 
           8        0        4    0.905    2.625     1.75 
                                      
    Value          1     3     4     5
    Frequency      3     3     1     1
    Proportion 0.375 0.375 0.125 0.125
    
    Starting estimates for imputed values:
    
      sz   sg   ap  sbp  dbp  age   wt   hg  ekg   pf   bm   hx 
    11.0 10.0  0.7 14.0  8.0 73.0 98.0 13.7  1.0  1.0  0.0  0.0 
    Code
    ggplot(ptrans, scale=TRUE) +
      theme(axis.text.x=element_text(size=6))

    Figure 8.3: Simultaneous transformation and single imputation of all candidate predictors using transcan. Imputed values are shown as red plus signs. Transformed values are arbitrarily scaled to \([0,1]\).

    Note that at face value the transformation of ap was derived in a circular manner, since the combined index of stage and histologic grade, sg, uses in its stage component a cutoff on ap. However, if sg is omitted from consideration, the resulting transformation for ap does not change appreciably. Note that bm and hx are represented as binary variables, so their coefficients in the table of canonical variable coefficients are on a different scale. For the variables that were actually transformed, the coefficients are for standardized transformed variables (mean 0, variance 1). From examining the \(R^2\)s, age, wt, ekg, pf, and hx are not strongly related to other variables. Imputations for age, wt, ekg are thus relying more on the median or modal values from the marginal distributions. From the coefficients of first (standardized) canonical variates, sbp is predicted almost solely from dbp; bm is predicted mainly from ap, hg, and pf. 2

  • 2 Schemper & Heinze (1997) used logistic models to impute dichotomizations of the predictors for this dataset.

  • 8.6 Data Reduction Using Principal Components

    The first PC, PC\(_{1}\), is the linear combination of standardized variables having maximum variance. PC\(_{2}\) is the linear combination of predictors having the second largest variance such that PC\(_{2}\) is orthogonal to (uncorrelated with) PC\(_{1}\). If there are \(p\) raw variables, the first \(k\) PCs, where \(k < p\), will explain only part of the variation in the whole system of \(p\) variables unless one or more of the original variables is exactly a linear combination of the remaining variables. Note that it is common to scale and center variables to have mean zero and variance 1 before computing PCs.

    The response variable (here, time until death due to any cause) is not examined during data reduction, so that if PCs are selected by variance explained in the \(X\)-space and not by variation explained in \(Y\), one needn’t correct for model uncertainty or multiple comparisons.

    PCA results in data reduction when the analyst uses only a subset of the \(p\) possible PCs in predicting \(Y\). This is called incomplete principal component regression. When one sequentially enters PCs into a predictive model in a strict pre-specified order (i.e., by descending amounts of variance explained for the system of \(p\) variables), model uncertainty requiring bootstrap adjustment is minimized. In contrast, model uncertainty associated with stepwise regression (driven by associations with \(Y\)) is massive.

    For the prostate dataset, consider PCs on raw candidate predictors, expanding polytomous factors using dummy variables. The R function princomp is used, after singly imputing missing raw values using transcan’s optimal additive nonlinear models. In this series of analyses we ignore the treatment variable, rx.

    Code
    spar(top=1)
    # Impute all missing values in all variables given to transcan
    imputed <- impute(ptrans, data=prostate, list.out=TRUE)
    
    
    Imputed missing values with the following frequencies
     and stored them in variables with their original names:
    
     sz  sg age  wt ekg 
      5  11   1   2   8 
    Code
    imputed <- as.data.frame(imputed)
    
    # Compute principal components on imputed data.
    # Create a design matrix from ekg categories
    Ekg <- model.matrix(~ ekg, data=imputed)[, -1]
    # Use correlation matrix
    pfn <- prostate$pfn
    prin.raw <- princomp(~ sz + sg + ap + sbp + dbp + age +
                         wt + hg + Ekg + pfn + bm + hx,
                         cor=TRUE, data=imputed)
    
    plot(prin.raw, type='lines', main='', ylim=c(0,3))
    # Add cumulative fraction of variance explained
    addscree <- function(x, npcs=min(10, length(x$sdev)),
                         plotv=FALSE,
                         col=1, offset=.8, adj=0, pr=FALSE) {
      vars <- x$sdev^2
      cumv <- cumsum(vars)/sum(vars)
      if(pr) print(cumv)
      text(1:npcs, vars[1:npcs] + offset*par('cxy')[2],
           as.character(round(cumv[1:npcs], 2)),
           srt=45, adj=adj, cex=.65, xpd=NA, col=col)
      if(plotv) lines(1:npcs, vars[1:npcs], type='b', col=col)
    }
    addscree(prin.raw)
    prin.trans <- princomp(ptrans$transformed, cor=TRUE)
    addscree(prin.trans, npcs=10, plotv=TRUE, col='red',
             offset=-.8, adj=1)

    Figure 8.4: Variance of the system of raw predictors (black) explained by individual principal components (lines) along with cumulative proportion of variance explained (text), and variance explained by components computed on transcan-transformed variables (red)

    The resulting plot shown in Figure 8.4 is called a “scree” plot [Jolliffe (2010); pp. 96–99, 104, 106]. It shows the variation explained by the first \(k\) principal components as \(k\) increases all the way to 16 parameters (no data reduction). It requires 10 of the 16 possible components to explain \(> 0.8\) of the variance, and the first 5 components explain \(0.49\) of the variance of the system. Two of the 16 dimensions are almost totally redundant.

    After repeating this process when transforming all predictors via transcan, we have only 12 degrees of freedom for the 12 predictors. The variance explained is depicted in Figure 8.4 in red. It requires at least 9 of the 12 possible components to explain \(\geq 0.9\) of the variance, and the first 5 components explain \(0.66\) of the variance as opposed to \(0.49\) for untransformed variables.

    Let us see how the PCs “explain” the times until death using the Cox regression (Cox, 1972) function from rms, cph, described in Chapter 20. In what follows we vary the number of components used in the Cox models from 1 to all 16, computing the AIC for each model. AIC is related to model log likelihood penalized for number of parameters estimated, and lower is better. For reference, the AIC of the model using all of the original predictors, and the AIC of a full additive spline model are shown as horizontal lines.

    Code
    require(rms)
    spar(bty='l')
    S <- with(prostate, Surv(dtime, status != "alive"))
    # two-column response var.
    
    pcs <- prin.raw$scores         # pick off all PCs
    aic <- numeric(16)
    for(i in 1:16) {
      ps <- pcs[,1:i]
      aic[i] <- AIC(cph(S ~ ps))
    }
    plot(1:16, aic, xlab='Number of Components Used',
         ylab='AIC', type='l', ylim=c(3950,4000))
    f <- cph(S ~ sz + sg + log(ap) + sbp + dbp + age + wt + hg +
             ekg + pf + bm + hx, data=imputed)
    abline(h=AIC(f), col='blue')
    ## The following model in the 2nd edition no longer converges
    # f <- cph(S ~ rcs(sz,5) + rcs(sg,5) + rcs(log(ap),5) +
    #          rcs(sbp,5) + rcs(dbp,5) + rcs(age,3) + rcs(wt,5) +
    #          rcs(hg,5) + ekg + pf + bm + hx,
    #          tol=1e-14, data=imputed)
    f <- cph(S ~ rcs(sz,4) + rcs(sg,4) + rcs(log(ap),5) +
             rcs(sbp,4) + rcs(dbp,4) + rcs(age,3) + rcs(wt,4) +
             rcs(hg,4) + ekg + pf + bm + hx,
             tol=1e-14, data=imputed)
    abline(h=AIC(f), col='blue', lty=2)

    Figure 8.5: AIC of Cox models fitted with progressively more principal components. The solid blue line depicts the AIC of the model with all original covariates. The dotted blue line is positioned at the AIC of the full spline model.

    For the money, the first 5 components adequately summarizes all variables, if linearly transformed, and the full linear model is no better than this. The model allowing all continuous predictors to be nonlinear is not worth its added degrees of freedom.

    Next check the performance of a model derived from cluster scores of transformed variables.

    Code
    # Compute PC1 on a subset of transcan-transformed predictors
    pco <- function(v) {
      f <- princomp(ptrans$transformed[,v], cor=TRUE)
      vars <- f$sdev^2
      cat('Fraction of variance explained by PC1:',
          round(vars[1]/sum(vars),2), '\n')
      f$scores[,1]
    }
    tumor   <- pco(c('sz','sg','ap','bm'))
    Fraction of variance explained by PC1: 0.59 
    Code
    bp      <- pco(c('sbp','dbp'))
    Fraction of variance explained by PC1: 0.84 
    Code
    cardiac <- pco(c('hx','ekg'))
    Fraction of variance explained by PC1: 0.61 
    Code
    # Get transformed individual variables that are not clustered
    other   <- ptrans$transformed[,c('hg','age','pf','wt')]
    f <- cph(S ~ tumor + bp + cardiac + other)  # other is matrix
    AIC(f)
    [1] 3954.393
    Code
    print(f, latex=TRUE, long=FALSE, title='')
    Model Tests Discrimination
    Indexes
    Obs 502 LR χ2 81.11 R2 0.149
    Events 354 d.f. 7 R27,502 0.137
    Center 0 Pr(>χ2) 0.0000 R27,354 0.189
    Score χ2 86.81 Dxy 0.286
    Pr(>χ2) 0.0000
    β S.E. Wald Z Pr(>|Z|)
    tumor  -0.1723  0.0367 -4.69 <0.0001
    bp   0.0251  0.0424 0.59 0.5528
    cardiac   0.2513  0.0516 4.87 <0.0001
    hg  -0.1407  0.0554 -2.54 0.0111
    age  -0.1034  0.0579 -1.79 0.0739
    pf  -0.0933  0.0487 -1.92 0.0551
    wt  -0.0910  0.0555 -1.64 0.1012

    The tumor and cardiac clusters seem to dominate prediction of mortality, and the AIC of the model built from cluster scores of transformed variables compares favorably with other models (Figure 8.5).

    8.6.1 Sparse Principal Components

    A disadvantage of principal components is that every predictor receives a nonzero weight for every component, so many coefficients are involved even through the effective degrees of freedom with respect to the response model are reduced. Sparse principal components (Witten & Tibshirani, 2008) uses a penalty function to reduce the magnitude of the loadings variables receive in the components. If an L1 penalty is used (as with the lasso, some loadings are shrunk to zero, resulting in some simplicity. Sparse principal components combines some elements of variable clustering, scoring of variables within clusters, and redundancy analysis.

    Filzmoser et al. (2012) have written a nice R package pcaPP for doing sparse PC analysis.3 The following example uses the prostate data again. To allow for nonlinear transformations and to score the ekg variable in the prostate dataset down to a scalar, we use the transcan-transformed predictors as inputs.

  • 3 The spca package is a new sparse PC package that should also be considered.

  • Code
    require(pcaPP)
    spar()
    s <- sPCAgrid(ptrans$transformed, k=10, method='sd',
                  center=mean, scale=sd, scores=TRUE,
                  maxiter=10)
    plot(s, type='lines', main='', ylim=c(0,3))
    addscree(s)
    lo <- unclass(s$loadings)
    flo <- format(round(lo, 2), zero.print=FALSE)
    print(flo, quote=FALSE)    # These loadings are on the orig. transcan scale
        Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 Comp.9 Comp.10
    sz   0.25                                                    0.95   0.04  
    sg   0.62                                                   -0.07   0.52  
    ap   0.63                                                   -0.30         
    sbp        -0.71                                                          
    dbp         0.71                                                          
    age                       1.00                                            
    wt                                             1.00                       
    hg                                      1.00                              
    ekg                1.00                                                   
    pf                                                    1.00                
    bm  -0.39                                                           0.85  
    hx                               1.00                                     

    Figure 8.6: Variance explained by individual sparse principal components (lines) along with cumulative proportion of variance explained (text)

    Only nonzero loadings are shown. The first sparse PC is the tumor cluster used above, and the second is the blood pressure cluster. Let us see how well incomplete sparse principal component regression predicts time until death.

    Code
    spar(bty='l')
    pcs <- s$scores         # pick off sparse PCs
    aic <- numeric(10)
    for(i in 1:10) {
      ps <- pcs[,1:i]
      aic[i] <- AIC(cph(S ~ ps))
    }
    plot(1:10, aic, xlab='Number of Components Used',
         ylab='AIC', type='l',  ylim=c(3950,4000))

    Figure 8.7: Performance of sparse principal components in Cox models

    More components are required to optimize AIC than were seen in Figure 8.5, but a model built from 6–8 sparse PCs performed as well as the other models.

    8.7 Transformation Using Nonparametric Smoothers

    The ACE nonparametric additive regression method of Breiman & Friedman (1985) transforms both the left-hand-side variable and all the right-hand-side variables so as to optimize \(R^{2}\). ACE can be used to transform the predictors using the R ace function in the acepack package, called by the transace function in the Hmisc package. transace does not impute data but merely does casewise deletion of missing values. Here transace is run after single imputation by transcan. binary is used to tell transace which variables not to try to predict (because they need no transformation). Several predictors are restricted to be monotonically transformed.

    Code
    spar(mfrow=c(4,3), left=-2)
    x <- with(imputed,
              cbind(sz, sg, ap, sbp, dbp, age, wt, hg, ekg, pf,
                    bm, hx))
    monotonic <- c("sz","sg","ap","sbp","dbp","age","pf")
    transace(x, monotonic,
             categorical="ekg", binary=c("bm","hx"))
    R-squared achieved in predicting each variable:
    
           sz        sg        ap       sbp       dbp       age        wt        hg 
    0.2279024 0.5748361 0.5728639 0.4823925 0.4579628 0.1514941 0.1718541 0.2020425 
          ekg        pf        bm        hx 
    0.1113341 0.1774896        NA        NA 

    Figure 8.8: Simultaneous transformation of all variables using ACE.

    Except for ekg, age, and for arbitrary sign reversals, the transformations in Figure 8.8 determined using transace were similar to those in Figure 8.3. The transcan transformation for ekg makes more sense.