13 Analysis of Covariance in Randomized Studies
Introduction
 Fundamental clinical question: if I were to give this patient treatment B instead of treatment A, by how much should I expect the clinical outcome to be better?
 A goldstandard design is a 6period 2treatment randomized crossover study; the patient actually receives both treatments and her responses can be compared
 Using each patient as her own control
 Studies of identical twins come somewhat close to that
 Short of these designs, the best we can do is to ask this fundamental question:
If two patients differ at baseline (with respect to available measurements, and on the average with respect to unmeasured covariates) only with regard to assigned treatment, how do their expected outcomes differ?  Covariate adjustment does that
 Analysis of covariance is classical terminology from linear models but we often use the term also for nonlinear models
 Covariate adjustment means to condition on important facts known about the patients before treatment is given
 We treat baseline values as known constants and do not model their distributions
 Doing a comparison of outcomes that is equalized on baseline characteristics (within treatment group) accounts for outcome heterogeneity within treatment group
 Accounting for easily accountable (baseline measurements) outcome heterogeneity either
 maximizes power and precision (in linear models, by reducing residual variance), or
 gets the model “right” which prevents power loss caused by unexplained outcome heterogeneity (in nonlinear models, by making the model much more correct than assuming identical outcome tendencies for all subjects in a treatment group)
The model to use depends on the nature of the outcome variable Y, and different models have different effect measures:
 Continuous Y: we often use difference in means; 2sample \(t\)test assumes all observations in a treatment group have the same mean
 analyzing heterogeneous \(Y\) as if homogeneous (using \(t\)test instead of ANCOVA) lowers power and precision by expanding \(\sigma^2\), but does not bias the difference in means
 Binary Y: classic \(2\times 2\) table assumes every person on treatment T has the same probability of outcome
 risk difference is only a useful measure in this homogeneous case
 in heterogeneous case, risk difference will vary greatly, e.g., sicker patients have more “room to move”
 without covariate adjustment, odds ratios only apply to the homogenous case; in heterogeneous case the unadjusted odds ratio may apply to no one (see fharrell.com/post/marg)
 Time to event: Cox proportional hazards model and its special case the logrank test
without covariate adjustment assumes homogeneity within treatment group
unadjusted hazard ratio does not estimate a useful quantity if there is heterogeneity
nonproportional hazards is more likely without covariate adjustment than with it
 without adjustment the failure time distribution is really a mixture of multiple distributions
Hierarchy of Causal Inference for Treatment Efficacy
Let \(P_{i}\) denote patient \(i\) and the treatments be denoted by \(A\) and \(B\). Thus \(P_{2}^{B}\) represents patient 2 on treatment \(B\). \(\overline{P}_{1}\) represents the average outcome over a sample of patients from which patient 1 was selected.
Design  Patients Compared 

6period crossover  \(P_{1}^{A}\) vs \(P_{1}^{B}\) (directly measure HTE) 
2period crossover  \(P_{1}^{A}\) vs \(P_{1}^{B}\) 
RCT in idential twins  \(P_{1}^{A}\) vs \(P_{1}^{B}\) 
\(\parallel\) group RCT  \(\overline{P}_{1}^{A}\) vs \(\overline{P}_{2}^{B}\), \(P_{1}=P_{2}\) on avg 
Observational, good artificial control  \(\overline{P}_{1}^{A}\) vs \(\overline{P}_{2}^{B}\), \(P_{1}=P_{2}\) hopefully on avg 
Observational, poor artificial control  \(\overline{P}_{1}^{A}\) vs \(\overline{P}_{2}^{B}\), \(P_{1}\neq P_{2}\) on avg 
Realworld physician practice  \(P_{1}^{A}\) vs \(P_{2}^{B}\) 
13.1 Covariable Adjustment in Linear Models
If you fail to adjust for prespecified covariates, the statistical model’s residuals get larger. Not good for power, but incorporates the uncertainties needed for any possible random baseline imbalances.
— F Harrell 2019
 Model: \(E(Y  X) = X \beta + \epsilon\)
 Continuous response variable \(Y\), normal residuals
 Statistical testing for baseline differences is scientifically incorrect (Altman & Doré 1990, Begg 1990, Senn 1994, Austin et al 2010); as Bland & Altman (2011) stated, statistical tests draw inferences about populations, and the population model here would involve a repeat of the randomization to the whole population hence balance would be perfect. Therefore the null hypothesis of no difference in baseline distributions between treatment groups is automatically true.
 If we are worried about baseline imbalance we need to search patient records for counter– balancing factors
 ⇒ imbalance is not the reason to adjust for covariables
 Adjust to gain efficiency by subtracting explained variation
 Relative efficiency of unadjusted treatment comparison is \(1\rho^2\)
 Unadjusted analyses yields unbiased treatment effect estimate
See here for a detailed discussion, including reasons not to even stratify by treatment in “Table 1.”
13.2 Covariable Adjustment in Nonlinear Models
13.2.2 Models for Binary Response
 Model for probability of event must be nonlinear in predictors unless risk range is tiny
 Useful summary of relative treatment effect is the odds ratio (OR)
 Use of binary logistic model for covariable adjustment will result in an {} in the S.E. of the treatment effect (log odds ratio) (Robinson & Jewell (1991))
 But even with perfect balance, adjusted OR \(\neq\) unadjusted OR
 Adjusted OR will be greater than unadjusted OR
Example from GUSTO–I
D
 Steyerberg et al. (2000)
 Endpoint: 30–day mortality (0.07 overall)
 10,348 patients given accelerated \(t\)–PA
 20,162 patients given streptokinase (SK)
 Means and Percentages
Baseline Characteristic  \(t\)–PA  SK 

Age  61.0  60.9 
Female  25.3  25.3 
Weight  79.6  79.4 
Height  171.1  171.0 
Hypertension  38.2  38.1 
Diabetes  14.5  15.1 
Never smoked  29.8  29.6 
High cholesterol  34.6  34.3 
Previous MI  16.9  16.5 
Hypotension  8.0  8.3 
Tachycardia  32.5  32.7 
Anterior MI  38.9  38.9 
Killip class I  85.0  85.4 
ST elevation  37.3  37.8 
Unadjusted / Adj. Logistic Estimates
 With and without adjusting for 17 baseline characteristics
Type of Analysis  Log OR  S.E.  \(\chi^2\) 

Unadjusted  0.159  0.049  10.8 
Adjusted  0.198  0.053  14.0 
 Percent reduction in odds of death: 15% vs. 18
 0.159 (15%) is a biased estimate
 Increase in S.E. more than offset by increase in treatment effect
 Adjusted comparison based on 19% fewer patients would have given same power as unadjusted test
Code
load('gustomin.rda')
with(gustomin,
plot(density(p.sk), xlim=c(0, .4), xlab='Baseline Expected Risk',
ylab='Probability Density', main='') )
 See this for a more indepth analysis
 Robinson & Jewell: “It is always more efficient to adjust for predictive covariates when logistic models are used, and thus in this regard the behavior of logistic regression is the same as that of classic linear regression.”
Simple Logistic Example – Gail 1986
Males  A  B  Total  

Y=0  500  900  1400  
Y=1  500  100  600  
Total  1000  1000  2000  OR=9 
Females  A  B  Total  

Y=0  100  500  600  
Y=1  900  500  1400  
Total  1000  1000  2000  OR=9 
Pooled  A  B  Total  

Y=0  600  1400  2000  
Y=1  1400  600  2000  
Total  2000  2000  4000  OR=5.44 
From seeing this example one can argue that odds ratios, like hazard ratios, were never really designed to be computed on a set of subjects having heterogeneity in their expected outcomes. See fharrell.com/post/marg for more.
13.2.3 Nonlinear Models, General
G
 M. H. Gail et al. (1984) showed that for unadjusted treatment estimates to be unbiased, regression must be linear or exponential
 Mitchell H. Gail (1986) showed that for logistic, Cox, and paired survival models unadjusted treatment effects are asymptotically biased low in absolute value
 Gail also studied normal, exponential, additive risk, Poisson
Part of the problem with Poisson, proportional hazard and logistic regression approaches is that they use a single parameter, the linear predictor, with no equivalent of the variance parameter in the Normal case. This means that lack of fit impacts on the estimate of the predictor.
— Senn (2004), p. 3747
13.3 Cox / Log–Rank Test for Time to Event
 Lagakos & Schoenfeld 1984 showed that type I probability is preserved if don’t adjust
 If hazards are proportional conditional on covariables, they are not proportional if omit covariables
 Morgan 1986 derived asymptotic relative efficiencies (ARE) of unadjusted log–rank test if a binary covariable is omitted
 If prevalence of covariable \(X\) is 0.5:
\(X=1 : X=0\) Hazard Ratio  ARE 

1.0  1.00 
1.5  0.95 
2.0  0.88 
3.0  0.72 
 Ford et al. (1995): Treatment effect does not have the same interpretation under unadjusted and adjusted models
 No reason for the two hazard ratios to have the same value
 Akazawa et al. (1997): Power of unadjusted and stratified log–rank test
Number  Range of  Power  Power 

of Strata  Log Hazards  Unadj.  Adjusted 
1  0  .78  – 
2  0–0.5  .77  .78 
0–1  .67  .78  
0–2  .36  .77  
4  0–3  .35  .77 
8  0–3.5  .33  .77 
13.3.1 Sample Size Calculation Issues
 Schoenfeld (1983) implies that covariable adjustment can only \(\uparrow\) sample size in randomized trials
 Need to recognize ill–definition of unadjusted hazard ratios
13.4 Why are Adjusted Estimates Right?
 Hauck et al. (1998), who have an excellent review of covariable adjustment in nonlinear models, state:
For use in a clinician–patient context, there is only a single person, that patient, of interest. The subjectspecific measure then best reflects the risks or benefits for that patient. Gail has noted this previously [ENAR Presidential Invited Address, April 1990], arguing that one goal of a clinical trial ought to be to predict the direction and size of a treatment benefit for a patient with specific covariate values. In contrast, population–averaged estimates of treatment effect compare outcomes in groups of patients. The groups being compared are determined by whatever covariates are included in the model. The treatment effect is then a comparison of average outcomes, where the averaging is over all omitted covariates.
— Hauck et al. (1998)
13.5 How Many Covariables to Use?
 Try to adjust for the bulk of the variation in outcome (Hauck et al. (1998),Tukey (1993))
 Neuhaus (1998): “to improve the efficiency of estimated covariate effects of interest, analysts of randomized clinical trial data should adjust for covariates that are strongly associated with the outcome”
 Raab et al. (2004) have more guidance for choosing covariables and provide a formula for linear model that shows how the value of adding a covariable depends on the sample size
13.6 Differential and Absolute Treatment Effects
13.6.1 Modeling Differential Treatment Effect
Differential treatment effect is often called heterogeneity of treatment effect or HTE. As opposed to the natural expansion of absolute treatment effect with underlying subject risk, differential treatment effect is usually based on analyses of relative effects, especially when the outcome is binary.
The most common approach to analyzing differential treatment effect involves searching for such effects rather than estimating the differential effect. This is, tragically, most often done through subgroup analysis.
Problems With Subgroup Analysis
 Subgroup analysis is widely practiced and widely derided in RCTs (as well as in observational studies)
 Construction of a subgroup from underlying factors that are continuous in nature (e.g., “older” = age \(\geq 65\)) assumes that the treatment effect is like falling off a cliff, i.e., allornothing. Discontinuous treatment effects, like discontinuous main effects, have not been found and validated but have always been shown to have been an artificial construct driven by opinion and not data.
 Given a subgroup a simple label such as “class IV heart failure” may seem to be meaningful but subgrouping carries along other subject characteristics that are correlated with the subgrouping variable. So the subgroup’s treatment effect estimate has a more complex interpretation than what is thought.
 Researchers who don’t understand that “absence of evidence is not evidence for absence” interpret a significant effect in one subgroup but not in another as the treatment being effective in the first but ineffective in the second. The second \(P\)value may be large because of increased variability in the second subgroup or because of a smaller effective sample size in that group.
 Testing the treatment effect in a subgroup does not carry along with it the covariate adjustment needed and assumes there is no residual HTE within the subgroup.
 When treatment interacts with factors not used in forming the subgroup, or when the subgrouping variable interacts with an ignored patient characteristic, the subgroup treatment effect may be grossly misleading. As an example, in GUSTOI there was a strong interaction between Killip class and age. Unadjusted analysis of treatment effect within older subjects was partially an analysis of Killip class. And the treatment effect was not “significant” in older patients, leading many readers to conclude \(t\)–PA should not be given to them. In fact there was no evidence that age interacted with treatment, and the absolute risk reduction due to \(t\)–PA increased with age.
Specifying Interactions
N
 Assessing differential treatment effect best done with formal interaction tests rather than subgroup analysis
 Pre–specify sensible effect modifiers
 interactions between treatment and extent of disease
 “learned” interventions: interaction between treatment and duration of use by physician
 Interactions with center are not necessarily sensible
 Need to use penalized estimation (e.g., interaction effects as random effects) to get sufficient precision of differential treatment effects, if # interaction d.f. \(> 4\) for example Sargent & Hodges (1996) and Yamaguchi & Ohashi (1999)
Strategy for Analyzing Differential Treatment Effect
The broad strategy that is recommended is based on the following:
 Anything resembling subgroup analysis should be avoided
 Anything that assumes that the treatment effect has a discontinuity in a continuous variable should be avoided. Differential effects should be smooth doseresponse effects.
 Anything that results in a finding that has a simpler alternate explanation should be avoided
 Specify models so that an apparent interaction is not just a standin for an omitted main effect
Particulars of the strategy are:
 Formulate a subjectmatter driven covariate model, including all covariates understood to have strong effects on the outcome. Ensure that these covariates are adjusted for in every HTE analysis context.
 Main effects need to be flexibly modeled (e.g., using regression splines) so as to not assume linearity. False linearity assumptions can inflate apparent interaction effects because the interaction may be colinear with omitted nonlinear main effects.
 If the sample size allows, also model interaction effects as smooth nonlinear functions. As with main effects, it is not uncommon for interaction effects to be nonlinear, e.g., the effect of treatment is small for age \(<70\) then starts to expand rapidly after age 70^{1}. As a compromise, force interactions to be linear even if main effects are not, if the effective sample size does not permit estimating parameters for nonlinear differential effects.
 Consider effect modifiers that are somewhat uncorrelated with each other. Add the main effects corresponding to each potential effect modifier into the model described in the previous point. For example if the primary analysis uses a model containing age, sex, and severity of disease and one is interested in assessing HTE by race and by geographical region, add both race and region to the main effects that are adjusted for in all later models. This will result in proper covariate adjustment and will handle the case where an apparent interaction is partially explained by an omitted main effect that is colinear with one of the interacting factors.
 Carry out a joint (chunk) likelihood ratio or \(F\)test for all potential interaction effects combined. This test has a perfect multiplicity adjustment and will not be “brought down” by the potential interactions being colinear with each other. The \(P\)value from this joint test with multiple degrees of freedom will place the tests described in the next step in context. But note that with many d.f. the test may lack power.
 Consider each potential interaction oneatatime by adding that interaction term to the comprehensive main effects model. In the example given above, main effects simultaneously adjusted for would include treatment, age, sex, severity of disease, race, and region. Then treatment \(\times\) race is added to the model and tested. Then the race interaction term (but not its main effect) is removed from the model and is replaced by region interactions.
^{1} This does not imply that age should be modeled as a categorical variable to estimate the interaction; that would result in unexplained HTE in the age \(\geq 70\) interval.
Concerning the last step, we must really recognize that there are two purposes for the analysis of differential treatment effect:
 Understanding which subject characteristics are associated with alterations in efficacy
 Predicting the efficacy for an individual subject For the latter purpose, one might include all potential treatment interactions in the statistical model, whether interactions for multiple baseline variables are colinear or not. Such colinearities do not hurt predictions or confidence intervals for predictions. For the former purpose, interactions can make one effect modifier compete with another, and neither modifier will be significant when adjusted for the other. As an example, suppose that the model had main effects for treatment, age, sex, and systolic and diastolic blood pressure (SBP, DBP) and we entertained effect modification due to one or both of SBP, DBP. Because of the correlation between SBP and DBP, the main effects for these variables are colinear and their statistical significance is weakened because they estimate, for example, the effect of increasing SBP holding DBP constant (which is difficult to do). Likewise, treatment \(\times\) SBP and treatment \(\times\) DBP interactions are colinear with each other, making statistical tests for interaction have low power. We may find that neither SBP nor DBP interaction is “significant” after adjusting for the other, but that the combined chunk test for SBP or DBP interaction is highly significant. Someone who does not perform the chunk test may falsely conclude that SBP does not modify the treatment effect. This problem is more acute when more than two factors are allowed to interact with treatment.
Note that including interaction terms in the model makes all treatment effect estimates conditional on specific covariate values, effectively lowering the sample size for each treatment effect estimate. When there is no HTE, the overall treatment main effect without having interaction terms in the model is by far the highest precision estimate. There is a biasvariance tradeoff when considering HTE. Adding interaction terms lowers bias but greatly increases variance.
Another way to state the HTE estimation strategy is as follows.
 Pick a model for which it is mathematically possible that there be no interactions (no restrictions on model parameters)
 Develop a model with ample flexiblymodeled main effects and clinically prespecified interactions. Use a Bayesian skeptical prior or penalized maximum likelihood estimate to shrink the interaction terms if the effective sample size does not support the prespecified number of interaction parameters. Be sure to model interaction effects as continuous if they come from variables of an underlying continuous nature.
 Display and explain relative effects (e.g., odds ratios) as a function of interacting factors and link that to what is known about the mechanism of treatment effect.
 Put all this in a context (like Figure Figure 13.7) that shows absolute treatment benefit (e.g., risk scale) as a function of background risk and of interacting factors.
 State clearly that background risk comes from all nontreatment factors and is a riskdifference accelerator that would increase risk differences for any risk factor, not just treatment. Possibly provide a risk calculator to estimate background risk to plug into the \(x\)axis.
Consider simulated twotreatment clinical trial data to illustrate the “always adjust for all main effects but consider interactions one at a time” approach to analyzing and displaying evidence for differential treatment effec. After that we will illustrate a simultaneous interaction analysis. Simulate timetoevent data from an exponential distribution, and fit Cox proportional hazards models to estimate the interaction between age and treatment and between sex and treatment. The true age effect is simulated as linear in the log hazard and the treatment effect on the log relative hazard scale is proportional to how far above 60 years is a patient’s age, with no treatment benefit before age 60. This is a nonsimple interaction that could be exactly modeled with a linear spline function, but assume that the analyst does not know the true form of the interaction so she allows for a more general smooth form using a restricted cubic spline function. The data are simulated so that there is no sex interaction.
Code
require(rms)
options(prType='html') # for cph print, anova
set.seed(1)
< 3000 # total of 3000 subjects
n < rnorm(n, 60, 12)
age label(age) < 'Age'
< factor(sample(c('Male', 'Female'), n, rep=TRUE))
sex < factor(sample(c('A', 'B'), n, rep=TRUE))
treat < 15 * runif(n) # censoring time
cens < 0.02 * exp(0.04 * (age  60) + 0.4 * (sex == 'Female') 
h 0.04 * (treat == 'B') * pmax(age  60, 0))
< log(runif(n)) / h
dt label(dt) < 'Time Until Death or Censoring'
< ifelse(dt <= cens, 1, 0)
e < pmin(dt, cens)
dt units(dt) < 'Year'
< datadist(age, sex, treat); options(datadist='dd')
dd < Surv(dt, e)
S < cph(S ~ sex + rcs(age, 4) * treat)
f f
Cox Proportional Hazards Model
cph(formula = S ~ sex + rcs(age, 4) * treat)
Model Tests 
Discrimination Indexes 


Obs 3000  LR χ^{2} 130.78  R^{2} 0.047 
Events 470  d.f. 8  R^{2}_{8,3000} 0.040 
Center 2.3448  Pr(>χ^{2}) 0.0000  R^{2}_{8,470} 0.230 
Score χ^{2} 156.15  D_{xy} 0.261  
Pr(>χ^{2}) 0.0000 
β  S.E.  Wald Z  Pr(>Z)  

sex=Male  0.3628  0.0936  3.88  0.0001 
age  0.0423  0.0283  1.50  0.1349 
age’  0.0132  0.0655  0.20  0.8404 
age’’  0.0456  0.2499  0.18  0.8553 
treat=B  1.6377  1.6902  0.97  0.3326 
age × treat=B  0.0335  0.0357  0.94  0.3479 
age’ × treat=B  0.0643  0.0892  0.72  0.4709 
age’’ × treat=B  0.4048  0.3606  1.12  0.2616 
Code
anova(f)
Wald Statistics for S


χ^{2}  d.f.  P  

sex  15.03  1  0.0001 
age (Factor+Higher Order Factors)  102.95  6  <0.0001 
All Interactions  19.92  3  0.0002 
Nonlinear (Factor+Higher Order Factors)  6.04  4  0.1961 
treat (Factor+Higher Order Factors)  27.80  4  <0.0001 
All Interactions  19.92  3  0.0002 
age × treat (Factor+Higher Order Factors)  19.92  3  0.0002 
Nonlinear  4.11  2  0.1284 
Nonlinear Interaction : f(A,B) vs. AB  4.11  2  0.1284 
TOTAL NONLINEAR  6.04  4  0.1961 
TOTAL NONLINEAR + INTERACTION  20.80  5  0.0009 
TOTAL  138.91  8  <0.0001 
The model fitted above allows for a general age \(\times\) treatment interaction. Let’s explore this interaction by plotting the age effect separately by treatment group, then plotting the treatment B:A hazard ratio as a function of age.
Refit the model allowing for a sex interaction (but not an age interaction) and display the results.
Code
< cph(S ~ sex * treat + rcs(age, 4))
g g
Cox Proportional Hazards Model
cph(formula = S ~ sex * treat + rcs(age, 4))
Model Tests 
Discrimination Indexes 


Obs 3000  LR χ^{2} 108.40  R^{2} 0.039 
Events 470  d.f. 6  R^{2}_{6,3000} 0.034 
Center 1.2565  Pr(>χ^{2}) 0.0000  R^{2}_{6,470} 0.196 
Score χ^{2} 110.68  D_{xy} 0.249  
Pr(>χ^{2}) 0.0000 
β  S.E.  Wald Z  Pr(>Z)  

sex=Male  0.3494  0.1234  2.83  0.0046 
treat=B  0.2748  0.1222  2.25  0.0246 
age  0.0228  0.0173  1.31  0.1890 
age’  0.0452  0.0435  1.04  0.2981 
age’’  0.2088  0.1739  1.20  0.2301 
sex=Male × treat=B  0.0526  0.1894  0.28  0.7813 
Code
anova(g)
Wald Statistics for S


χ^{2}  d.f.  P  

sex (Factor+Higher Order Factors)  15.87  2  0.0004 
All Interactions  0.08  1  0.7813 
treat (Factor+Higher Order Factors)  10.18  2  0.0062 
All Interactions  0.08  1  0.7813 
age  78.81  3  <0.0001 
Nonlinear  1.85  2  0.3966 
sex × treat (Factor+Higher Order Factors)  0.08  1  0.7813 
TOTAL NONLINEAR + INTERACTION  1.92  3  0.5888 
TOTAL  104.05  6  <0.0001 
Code
< contrast(g, list(treat='B', sex=levels(sex)), list(treat='A', sex=levels(sex)))
k < as.data.frame(k[Cs(sex,age,Contrast,Lower,Upper)])
k kabl(k)
sex  age  Contrast  Lower  Upper 

Female  59.7296  0.2748  0.5144  0.0352 
Male  59.7296  0.3274  0.6108  0.0439 
Code
ggplot(Predict(g, sex, treat))
ggplot(k, aes(y=exp(Contrast), x=sex)) + geom_point() +
scale_y_log10(breaks=c(.5, .6, .7, .8, .9, 1, 1.1)) +
geom_linerange(aes(ymin=exp(Lower), ymax=exp(Upper))) +
xlab('Sex') + ylab('B:A Hazard Ratio') + coord_flip()
The above analysis signified a treatment effect for both sex groups (both confidence limits exclude a hazard ratio of 1.0), whereas this should only be the case when age exceeds 60. No sex \(\times\) treatment interaction was indicated. Redo the previous analysis adjusting for an age \(\times\) treatment interaction while estimating the sex \(\times\) treatment interaction. For this purpose we must specify an age value when estimating the treatment effects by sex. Here we set age to 50 years. Were age and sex correlated, the joint analysis would have been harder to interpret.
Code
< cph(S ~ treat * (rcs(age, 4) + sex))
h anova(h)
Wald Statistics for S


χ^{2}  d.f.  P  

treat (Factor+Higher Order Factors)  27.85  5  <0.0001 
All Interactions  19.98  4  0.0005 
age (Factor+Higher Order Factors)  103.01  6  <0.0001 
All Interactions  19.91  3  0.0002 
Nonlinear (Factor+Higher Order Factors)  6.05  4  0.1956 
sex (Factor+Higher Order Factors)  15.08  2  0.0005 
All Interactions  0.06  1  0.8096 
treat × age (Factor+Higher Order Factors)  19.91  3  0.0002 
Nonlinear  4.10  2  0.1286 
Nonlinear Interaction : f(A,B) vs. AB  4.10  2  0.1286 
treat × sex (Factor+Higher Order Factors)  0.06  1  0.8096 
TOTAL NONLINEAR  6.05  4  0.1956 
TOTAL INTERACTION  19.98  4  0.0005 
TOTAL NONLINEAR + INTERACTION  20.86  6  0.0019 
TOTAL  138.58  9  <0.0001 
Code
ggplot(Predict(h, sex, treat, age=50))
< contrast(h, list(treat='B', sex=levels(sex), age=50),
k list(treat='A', sex=levels(sex), age=50))
< as.data.frame(k[Cs(sex,age,Contrast,Lower,Upper)])
k ggplot(k, aes(y=exp(Contrast), x=sex)) + geom_point() +
scale_y_log10(breaks=c(.5, .6, .7, .8, .9, 1, 1.1)) +
geom_linerange(aes(ymin=exp(Lower), ymax=exp(Upper))) +
xlab('Sex') + ylab('B:A Hazard Ratio') + coord_flip()
This is a more accurate representation of the true underlying model, and is easy to interpret because (1) age and sex are uncorrelated in the simulation model used, and (2) only two interactions were considered.
Absolute vs. Relative Treatment Effects Revisited
Statistical models are typically chosen so as to maximize the likelihood of the model fitting the data and processes that generated them. Even though one may be interested in absolute effects, models must usually be based on relative effects so that quantities such as estimated risk are in the legal \([0,1]\) range. It is not unreasonable to think of a relative effects model such as the logistic model as providing supporting evidence that an interacting effect causes a change in the treatment effect, whereas the necessary expansion of treatment effect for higher risk subjects, as detailed in the next section, is merely a mathematical necessity for how probabilities work. One could perhaps say that any factor that confers increased risk for a subject (up to the point of diminishing returns; see below) might cause an increase in the treatment effect, but this is a general phenomenon that is spread throughout all risk factors independently of how they may directly affect the treatment effect. Because of this, additive risk models are not recommended for modeling outcomes or estimating differential treatment effects on an absolute scale. This logic leads to the following recommended strategy:
 Base all inference and predicted risk estimation on a model most likely to fit the data (e.g., logistic risk model; Cox model). Choose the model so that it is possible that there may be no interactions, i.e., a model that does not place a restriction on regression coefficients.
 Prespecify sensible possible interactions as described earlier
 Use estimates from this model to estimate relative differential treatment effects, which should be smooth functions of continuous baseline variables
 Use the same model to estimate absolute risk differences (as in the next section) as a function both of interacting factors and of baseline variables in the model that do not interact with treatment (but will still expand the absolute treatment effect)
 Even better: since expansion of risk differences is a general function of overall risk and not of individual risk factors only display effects of individual risk factors that interact with treatment (interacting on the scale that would have allowed them not to interact—the relative scale such as log odds) and show the general risk difference expansion as in Figures Figure 13.6 (think of the “risk factor” as treatment) and Figure 13.9
To translate the results of clinical trials into practice may require a lot of work involving modelling and further background information. ‘Additive at the point of analysis but relevant at the point of application’ should be the motto.
— Stephen Senn
13.6.2 Estimating Absolute Treatment Effects
 Absolute efficacy measures:
 Risk difference (\(\delta\))
 number needed to treat (reciprocal of risk difference)
 Years of life saved
 Quality–adjusted life years saved
 Binary response, no interactions with treatment, risk for control patient \(P\):
\(\delta = P  \frac{P}{P + (1P)/OR}\)  \(\delta\) is dominated by \(P\)
Code
plot(0, 0, type="n", xlab="Risk for Subject Without Risk Factor",
ylab="Increase in Risk",
xlim=c(0,1), ylim=c(0,.6))
< 0
i < c(1.1,1.25,1.5,1.75,2,3,4,5,10)
or for(h in or) {
< i + 1
i < seq(.0001, .9999, length=200)
p < log(p/(1  p)) # same as qlogis(p)
logit < logit + log(h) # modify by odds ratio
logit < 1/(1 + exp(logit))# same as plogis(logit)
p2 < p2  p
d lines(p, d, lty=i)
< max(d)
maxd < p[d==maxd]
smax text(smax, maxd + .02, format(h), cex=.6)
}
If the outcome is such that \(Y=1\) implies a good outcome, Figure Figure 13.6 would be useful for estimating the absolute risk increase for a “good” treatment by selecting the one curve according to the odds ratio the treatment achieved in a multivariable risk model. This assumes that the treatment does not interact with any patient characteristic(s).
van Klaveren et al. (2015) described the importance of correctly modeling interactions when estimating absolute treatment benefit.
Now consider the case where \(Y=1\) is a bad outcome and \(Y=0\) is a good outcome, and there is differential relative treatment effect according to a truly binary patient characteristic \(X=0,1\). Suppose that treatment represents a new agent and the control group is patients on standard therapy. Suppose that the new treatment multiplies the odds of a bad outcome by 0.8 when \(X=0\) and by 0.6 when \(X=1\), and that the background risk that \(Y=1\) for patients on standard therapy ranges from 0.01 to 0.99. The background risk could come from one or more continuous variables or mixtures of continuous and categorical patient characteristics. The “main effect” of \(X\) must also be specified. We assume that \(X=0\) goes into the background risk and \(X=1\) increases the odds that \(Y=1\) by a factor of 1.4 for patients on standard therapy. All of this specifies a full probability model that can be evaluated to show the absolute risk reduction by the new treatment as a function of background risk and \(X\).
Code
require(Hmisc)
< expand.grid(X=0:1, brisk=seq(0.01, 0.99, length=150))
d < upData(d,
d risk.standard = plogis(qlogis(brisk) + log(1.4) * X),
risk.new = plogis(qlogis(brisk) + log(1.4) * X +
log(0.8) * (X == 0) +
log(0.6) * (X == 1)),
risk.diff = risk.standard  risk.new,
X = factor(X) )
Input object size: 19040 bytes; 2 variables 300 observations
Added variable risk.standard
Added variable risk.new
Added variable risk.diff
Modified variable X
New object size: 27152 bytes; 5 variables 300 observations
Code
ggplot(d, aes(x=risk.standard, y=risk.diff, color=X)) +
geom_line() +
xlab('Risk Under Standard Treatment') +
ylab('Absolute Risk Reduction With New Treatment') # (* `r ipacue()` *)
It is important to note that the magnification of absolute risk reduction by increasing background risk should not be labeled (or analyzed) by any one contributing risk factor. This is a generalized effect that comes solely from the restriction of probabilities to the \([0,1]\) range.
Absolute Treatment Effects for GUSTO–I
Y
 No evidence for interactions with treatment
 Misleading subgroup analysis showed that elderly patients not benefit from \(t\)–PA; result of strong age \(\times\) Killip class interaction
 Wide variation in absolute benefit of \(t\)–PA
Code
< with(gustomin, p.sk  p.tpa)
delta plot(density(delta), xlab='Mortality Difference',
ylab='Probability Density', main='')
< mean(delta)
m < par("usr")
u arrows(m, u[3], m, 0, length=.1, lwd=2)
text(m, 2, 'Mean', srt=45, adj=0)
< median(delta)
med arrows(med, u[3], med, 0, length=.1, lwd=2)
text(med, 2, 'Median', srt=45, adj=0)
 Overall mortality difference of 0.011 dominated by high–risk patients
Code
load('gusto.rda')
require(rms)
< datadist(gusto); options(datadist='dd')
dd < lrm(day30 ~ tx + age * Killip + pmin(sysbp, 120) +
f lsp(pulse, 50) + pmi + miloc, data=gusto)
cat('<br><small>')
Code
f
Logistic Regression Model
lrm(formula = day30 ~ tx + age * Killip + pmin(sysbp, 120) + lsp(pulse, 50) + pmi + miloc, data = gusto)
Model Likelihood Ratio Test 
Discrimination Indexes 
Rank Discrim. Indexes 


Obs 40830  LR χ^{2} 4173.41  R^{2} 0.245  C 0.821 
0 37979  d.f. 15  R^{2}_{15,40830} 0.097  D_{xy} 0.642 
1 2851  Pr(>χ^{2}) <0.0001  R^{2}_{15,7955.8} 0.407  γ 0.642 
max ∂log L/∂β 5×10^{6}  Brier 0.055  τ_{a} 0.083 
β  S.E.  Wald Z  Pr(>Z)  

Intercept  3.9541  0.7135  5.54  <0.0001 
tx=SK  0.0738  0.0512  1.44  0.1499 
tx=tPA  0.1338  0.0608  2.20  0.0276 
age  0.0867  0.0026  33.67  <0.0001 
Killip=II  2.1146  0.3610  5.86  <0.0001 
Killip=III  3.7596  0.7310  5.14  <0.0001 
Killip=IV  4.0790  0.8259  4.94  <0.0001 
sysbp  0.0386  0.0017  23.10  <0.0001 
pulse  0.0221  0.0141  1.57  0.1168 
pulse’  0.0416  0.0143  2.90  0.0037 
pmi=yes  0.4664  0.0485  9.62  <0.0001 
miloc=Other  0.3048  0.1163  2.62  0.0087 
miloc=Anterior  0.5370  0.0443  12.12  <0.0001 
age × Killip=II  0.0216  0.0051  4.22  <0.0001 
age × Killip=III  0.0363  0.0103  3.51  0.0004 
age × Killip=IV  0.0323  0.0124  2.61  0.0090 
Code
anova(f)
Wald Statistics for day30


χ^{2}  d.f.  P  

tx  15.46  2  0.0004 
age (Factor+Higher Order Factors)  1390.07  4  <0.0001 
All Interactions  31.13  3  <0.0001 
Killip (Factor+Higher Order Factors)  427.94  6  <0.0001 
All Interactions  31.13  3  <0.0001 
sysbp  533.64  1  <0.0001 
pulse  325.19  2  <0.0001 
Nonlinear  8.43  1  0.0037 
pmi  92.55  1  <0.0001 
miloc  146.92  2  <0.0001 
age × Killip (Factor+Higher Order Factors)  31.13  3  <0.0001 
TOTAL NONLINEAR + INTERACTION  39.17  4  <0.0001 
TOTAL  3167.41  15  <0.0001 
Code
cat('</small>')
Code
< coef(f) # vector of regression coefficients
cof # For cof, X*beta without treatment coefficients estimates logit
# for SK+tPA combination therapy (reference cell). The coefficient for
# SK estimates the difference in logits from combo to SK. The coefficient
# for tPA estimates the difference in tPA from combo. The mortality
# difference of interest is mortality with SK minus mortality with tPA.
< function(x) plogis(x + cof['tx=SK'])
mort.sk < function(x)
mort.diff ifelse(x < 0, mort.sk(x)  plogis(x + cof['tx=tPA']), NA)
# only define when logit < 0 since Ushaped
< nomogram(f, fun=list(mort.sk, mort.diff),
n funlabel=c("30Day Mortality\nFor SK Treatment",
"Mortality Reduction by tPA"),
fun.at=list(c(.001,.005,.01,.05,.1,.2,.5,.7,.9),
c(.001,.005,.01,.02,.03,.04,.05)),
pulse=seq(0,260,by=10), omit='tx', lp=FALSE)
Code
plot(n, varname.label.sep=' ', xfrac=.27, lmgp=.2, cex.axis=.6)
Absolute Benefit on Survival Prob.
B
 Cox PH model
 Modeling can uncover time course of treatment effect
 \(X_1\) = treatment, \(A={X_{2},\ldots,X_{p}}\) adjustment variables
 Survival difference is
\(S(t  X_{1}=1, A)  S(t  X_{1}=0, A) = S(t  X_{1}=0, A)^{HR}  S(t  X_{1}=0, A)\)
Code
plot(0, 0, type="n", xlab="Survival for Control Subject",
ylab="Improvement in Survival",
xlim=c(0,1), ylim=c(0,.7))
< 0
i < seq(.1, .9, by=.1)
hr for(h in hr) {
< i + 1
i < seq(.0001, .9999, length=200)
p < p ^ h
p2 < p2  p
d lines(p, d, lty=i)
< max(d)
maxd < p[d==maxd]
smax text(smax,maxd+.02, format(h), cex=.6)
}
 See also Kent & Hayward (2007)
13.7 Cost–Effectiveness Ratios
 Effectiveness E (denominator of C–E ratio) is always absolute
 Absolute treatment effectiveness varies greatly with patient characteristics
 ⇒ C–E ratio varies greatly
 A C–E ratio based on average E and average C may not apply to any existing patient!
 Need a model to estimate E
 C may also depend on patient characteristics
Code
< 2400 / delta / 1e6
cost.life plot(density(cost.life), xlab='Cost Per Life Saved, $M', main='',
ylab='Probability Density', xlim=c(0, 6))
< 2400 / mean(delta) / 1e6
m < par("usr")
u arrows(m, u[3], m, 0, length=.1, lwd=2)
text(m,.01,'Cost using\n average\n reduction',srt=45,adj=0)
13.8 Treatment Contrasts for Multi–Site Randomized Trials
Primary model: covariables, treatment, site main effects
Planned secondary model to assess consistency of treatment effects over sites (add site \(\times\) treatment interactions)
Advantages for considering sites as random effects (or use penalized MLE to shrink site effects, especially for small sites). See Andersen et al. (1999) for a random effects Cox model and a demonstration that treatment effects may be inconsistent when non–zero site main effects are ignored in the Cox model. See also Yamaguchi & Ohashi (1999).
Types of tests / contrasts when interactions are included (Schwemer (2000)):
 Type I: not adjusted for center
 Type II: average treatment effect, weighted by size of sites
Rrms
package code:
sites < levels(site) contrast(fit, list(treat='b', site=sites), list(treat='a', site=sites), type='average', weights=table(site))
 Type III: average treatment effect, unweighted
contrast(fit, list(treat='b', site=sites), list(treat='a', site=sites), type='average')
Low precision; studies are not powered for Type III tests.
Another interesting test: combined effects of treatment and site \(\times\) treatment interaction; tests whether treatment was effective at any site.
13.9 Statistical Plan for Randomized Trials
 FDA draft guidance  advocates for many forms of covariate adjustment
 When a relevant dataset is available before the trial begins, develop the model from the dataset and use the predicted value as a single adjustment covariable in the trial (Knaus et al. (1993))
 Otherwise: CPMP Working Party: Finalize choice of model, transformations, interactions before merging treatment assignment into analysis dataset.
Edwards (1999): Pre–specify family of models that will be used, along with the strategy for selecting the particular model.
Masked model derivation does not bias treatment effect.  CPMP guidance (Committee for Proprietary Medicinal Products, 2004)
 ``Stratification may be used to ensure balance of treatments across covariates; it may also be used for administrative reasons. The factors that are the basis of stratification should normally be included as covariates in the primary model.
 Variables known a priori to be strongly, or at least moderately, associated with the primary outcome and/or variables for which there is a strong clinical rationale for such an association should also be considered as covariates in the primary analysis. The variables selected on this basis should be prespecified in the protocol or the statistical analysis plan.
 Baseline imbalance observed post hoc should not be considered an appropriate reason for including a variable as a covariate in the primary analysis.
 Variables measured after randomization and so potentially affected by the treatment should not normally be included as covariates in the primary analysis.
 If a baseline value of a continuous outcome measure is available, then this should usually be included as a covariate. This applies whether the primary outcome variable is defined as the ‘raw outcome’ or as the ‘change from baseline’.
 Only a few covariates should be included in a primary analysis. Although larger data sets may support more covariates than smaller ones, justification for including each of the covariates should be provided. (???)
 In the absence of prior knowledge, a simple functional form (usually either linearity or dichotomising a continuous scale) should be assumed for the relationship between a continuous covariate and the outcome variable. (???)
 The validity of the model assumptions must be checked when assessing the results. This is particularly important for generalized linear or nonlinear models where misspecification could lead to incorrect estimates of the treatment effect. Even under ordinary linear models, some attention should be paid to the possible influence of extreme outlying values.
 Whenever adjusted analyses are presented, results of the treatment effect in subgroups formed by the covariates (appropriately categorised, if relevant) should be presented to enable an assessment of the validity of the model assumptions. (???)
 Sensitivity analyses should be preplanned and presented to investigate the robustness of the primary results. Discrepancies should be discussed and explained. In the presence of important differences that cannot be logically explainedfor example, between the results of adjusted and unadjusted analysesthe interpretation of the trial could be seriously affected.
 The primary model should not include treatment by covariate interactions. If substantial interactions are expected a priori, the trial should be designed to allow separate estimates of the treatment effects in specific subgroups.
 Exploratory analyses may be carried out to improve the understanding of covariates not included in the primary analysis, and to help the sponsor with the ongoing development of the drug.
 A primary analysis, unambiguously prespecified in the protocol or statistical analysis plan, correctly carried out and interpreted, should support the conclusions which are drawn from the trial. Since there may be a number of alternative valid analyses, results based on prespecified analyses will carry most credibility.’’
In confirmatory trials, a model is prespecified, and it is necessary to pretend that it is true. In most other statistical applications, the choice of model is data–driven, but it is necessary to pretend that it is not.
— Edwards (1999); see also Sigueira & Taylor (1999)
 Choose predictors based on expert opinion
 Impute missing values rather than discarding observations
 Keep all pre–specified predictors in model, regardless of \(P\)–value
 Use shrinkage (penalized maximum likelihood estimation) to avoid over–adjustment
 Some guidance for handling missing baseline data in RCTs is in White & Thompson (2005)
13.9.1 Sites vs. Covariables
 Site effects (main or interaction) are almost always trivial in comparison with patientspecific covariable effects
 It is not sensible to include site in the model when important covariables are omitted
 The most logical and usually the most strong interaction with treatment is not site but is the severity of disease being treated
13.9.2 Covariable Adjustment vs. Allocation Based on Covariates
The decision to fit prognostic factors has a far more dramatic effect on the precision of our inferences than the choice of an allocation based on covariates or randomization approach and one of my chief objections to the allocation based on covariates approach is that trialists have tended to use the fact that they have balanced as an excuse for not fitting. This is a grave mistake.
— Senn (2004) p. 3748; see also Senn et al. (2010)
My view .. was that the form of analysis envisaged (that is to say, which factors and covariates should be fitted) justified the allocation and not vice versa.
— Senn (2004), p. 3747
13.10 Robustness
Regression models relate some transformation, called the link function, of \(Y\) to the linear predictor \(X\beta\). For multiple linear regression the link function is the identity function. For binary \(Y\) the most popular link function is the log odds (logit) but other links are possible. White et al. (2021) have shown that when a standard link function (e.g., logit for binary \(Y\) and the identity function for continuous \(Y\)) is used, analysis of covariance is robust to model misspecification. See here for simulations supporting this for binary logistic regression. For a proportional odds ordinal logistic model, a simulated example shows robustness even in the face of adjusting for a very important covariate that causes an extreme violation of the proportional odds assumption. The effect of such a violation is to effectively ignore the covariate, making the adjusted analysis virtually the same as the unadjusted analysis if no other covariates are present.
In general, violation of model assumptions about treatment effect parameters do not affect type I assertion probability \(\alpha\). That is because under the null hypothesis that the treatment is ignorable, there are no assumptions to be made about the form of treatment effects. For example in a proportional odds ordinal logistic model, under \(H_0\) the treatment effect must operate in a proportional odds fashion (i.e., all odds ratios involving treatment are 1.0 so they are all equal). It is impossible to violate the proportional odds assumption under the null.
An example of poor model fitting where type I assertion probability \(\alpha\) for testing treatment the difference was not preserved is in the analysis of longitudinal ordinal response under a firstorder Markov process when there is a strong time effect that is mismodeled. If the mixture of outcome states changes arbitrarily over time so that time operates in a nonproportional odds manner but proportional odds is falsely assumed, the ordinary standard error estimates for the treatment effect are too small and \(\alpha\) is significantly larger than the intended 0.05. Using the cluster sandwich covariance matrix estimator in place of the usual information matrixbased standard errors completely fixed that problem.
13.11 Summary
The point of view is sometimes defended that analyses that ignore covariates are superior because they are simpler. I do not accept this. A value of \(\pi=3\) is a simple one and accurate to one significant figure … However very few would seriously maintain that if should generally be adopted by engineers.
— Senn (2004), p. 3741
 As opposed to simple treatment group comparisons, modeling can
 Improve precision (linear, log–linear models)
 Get the “right” treatment effect (nonlinear models)
 Improve power (almost all models)
 Uncover outcome patterns, shapes of effects
 Test/estimate differential treatment benefit
 Determine whether some patients are too sick or too well to benefit
 Estimate absolute clinical benefit as a function of severity of illness
 Estimate meaningful cost–effectiveness ratios
 Help design the next clinical trial (optimize risk distribution for maximum power)
 Modeling strategy must be well thought–out
 Not “data mining”
 Not done to optimize the treatment \(P\)–value
13.12 Notes
I think it is most important to decide what it is you want to estimate, and then formulate a model that will accomplish that. Unlike ordinary linear models, which provide unbiased treatment effects if balanced covariates are mistakenly omitted from the model in an RCT, most models (such as the Cox PH model) result in biased treatment effects even when there is perfect balance in covariates, if the covariates have nonzero effects on the outcome. This is another way of talking about residual outcome heterogeneity.
If you want to estimate the effect of variable \(X\) on survival time, averaging over males and females in some strange undocumented way, you can get the population averaged effect of \(X\) without including sex in the model. Recognize however this is like comparing some of the males with some of the females when estimating the \(X\) effect. This is seldom of interest. More likely we want to know the effect of \(X\) for males, the effect for females, and if there is no interaction we pool the two to more precisely estimate the effect of \(X\) conditional on sex. Another way to view this is that the PH assumption is more likely to hold when you condition on covariates than when you don’t. No matter what happens though, if PH holds for one case, it cannot hold for the other, e.g., if PH holds after conditioning, it cannot hold when just looking at the marginal effect of X.
— From a posting by Harrell to the Medstats google group on 19Jan09
See the excellent Tweetorial by Darren Dahly here and this supplement to it by Ben Andrew.
This article by Stephen Senn is also very helpful.