# 13  Analysis of Covariance in Randomized Studies

Introduction

• Fundamental clinical question: if I were to give this patient treatment B instead of treatment A, by how much should I expect the clinical outcome to be better?
• A gold-standard design is a 6-period 2-treatment randomized crossover study; the patient actually receives both treatments and her responses can be compared
• Using each patient as her own control
• Studies of identical twins come somewhat close to that
• Short of these designs, the best we can do is to ask this fundamental question:
If two patients differ at baseline (with respect to available measurements, and on the average with respect to unmeasured covariates) only with regard to assigned treatment, how do their expected outcomes differ?
• Analysis of covariance is classical terminology from linear models but we often use the term also for nonlinear models
• Covariate adjustment means to condition on important facts known about the patients before treatment is given
• We treat baseline values as known constants and do not model their distributions
• Doing a comparison of outcomes that is equalized on baseline characteristics (within treatment group) accounts for outcome heterogeneity within treatment group
• Accounting for easily accountable (baseline measurements) outcome heterogeneity either
• maximizes power and precision (in linear models, by reducing residual variance), or
• gets the model “right” which prevents power loss caused by unexplained outcome heterogeneity (in nonlinear models, by making the model much more correct than assuming identical outcome tendencies for all subjects in a treatment group)

The model to use depends on the nature of the outcome variable Y, and different models have different effect measures:

• Continuous Y: we often use difference in means; 2-sample $$t$$-test assumes all observations in a treatment group have the same mean
• analyzing heterogeneous $$Y$$ as if homogeneous (using $$t$$-test instead of ANCOVA) lowers power and precision by expanding $$\sigma^2$$, but does not bias the difference in means
• Binary Y: classic $$2\times 2$$ table assumes every person on treatment T has the same probability of outcome
• risk difference is only a useful measure in this homogeneous case
• in heterogeneous case, risk difference will vary greatly, e.g., sicker patients have more “room to move”
• without covariate adjustment, odds ratios only apply to the homogenous case; in heterogeneous case the unadjusted odds ratio may apply to no one (see fharrell.com/post/marg)
• Time to event: Cox proportional hazards model and its special case the logrank test
• without covariate adjustment assumes homogeneity within treatment group

• unadjusted hazard ratio does not estimate a useful quantity if there is heterogeneity

• non-proportional hazards is more likely without covariate adjustment than with it

• without adjustment the failure time distribution is really a mixture of multiple distributions

Hierarchy of Causal Inference for Treatment Efficacy

Let $$P_{i}$$ denote patient $$i$$ and the treatments be denoted by $$A$$ and $$B$$. Thus $$P_{2}^{B}$$ represents patient 2 on treatment $$B$$. $$\overline{P}_{1}$$ represents the average outcome over a sample of patients from which patient 1 was selected.

Design Patients Compared
6-period crossover $$P_{1}^{A}$$ vs $$P_{1}^{B}$$ (directly measure HTE)
2-period crossover $$P_{1}^{A}$$ vs $$P_{1}^{B}$$
RCT in idential twins $$P_{1}^{A}$$ vs $$P_{1}^{B}$$
$$\parallel$$ group RCT $$\overline{P}_{1}^{A}$$ vs $$\overline{P}_{2}^{B}$$, $$P_{1}=P_{2}$$ on avg
Observational, good artificial control $$\overline{P}_{1}^{A}$$ vs $$\overline{P}_{2}^{B}$$, $$P_{1}=P_{2}$$ hopefully on avg
Observational, poor artificial control $$\overline{P}_{1}^{A}$$ vs $$\overline{P}_{2}^{B}$$, $$P_{1}\neq P_{2}$$ on avg
Real-world physician practice $$P_{1}^{A}$$ vs $$P_{2}^{B}$$

## 13.1 Covariable Adjustment in Linear Models

If you fail to adjust for pre-specified covariates, the statistical model’s residuals get larger. Not good for power, but incorporates the uncertainties needed for any possible random baseline imbalances.
— F Harrell 2019

A
• Model: $$E(Y | X) = X \beta + \epsilon$$
• Continuous response variable $$Y$$, normal residuals
• Statistical testing for baseline differences is scientifically incorrect (Altman & Doré 1990, Begg 1990, Senn 1994, Austin et al 2010); as Bland & Altman (2011) stated, statistical tests draw inferences about populations, and the population model here would involve a repeat of the randomization to the whole population hence balance would be perfect. Therefore the null hypothesis of no difference in baseline distributions between treatment groups is automatically true.
• If we are worried about baseline imbalance we need to search patient records for counter– balancing factors
• ⇒ imbalance is not the reason to adjust for covariables
• Adjust to gain efficiency by subtracting explained variation
• Relative efficiency of unadjusted treatment comparison is $$1-\rho^2$$
• Unadjusted analyses yields unbiased treatment effect estimate

See here for a detailed discussion, including reasons not to even stratify by treatment in “Table 1.”

## 13.2 Covariable Adjustment in Nonlinear Models

### 13.2.1 Hidden Assumptions in $$2 \times 2$$ Tables

B
• Traditional presentation of 2–treatment clinical trial with a binary response: $$2 \times 2$$ table
• Parameters: $$P_{1}, P_{2}$$ for treatments 1, 2
• Test of goodness of fit: $$H_0$$: all patients in one treatment group have same probability of positive response ($$P_j$$ constant)
• $$H_0$$: no risk factors exist
• Need to account for patient heterogeneity
• Matthews & Badi (2015) has a method for estimating the bias in unadjusted the log odds ratio and also has excellent background information

### 13.2.2 Models for Binary Response

C
• Model for probability of event must be nonlinear in predictors unless risk range is tiny
• Useful summary of relative treatment effect is the odds ratio (OR)
• Use of binary logistic model for covariable adjustment will result in an {} in the S.E. of the treatment effect (log odds ratio) (Robinson & Jewell (1991))
• But even with perfect balance, adjusted OR $$\neq$$ unadjusted OR

#### Example from GUSTO–I

D

• Steyerberg et al. (2000)
• Endpoint: 30–day mortality (0.07 overall)
• 10,348 patients given accelerated $$t$$–PA
• 20,162 patients given streptokinase (SK)
• Means and Percentages
Baseline Characteristic $$t$$–PA SK
Age 61.0 60.9
Female 25.3 25.3
Weight 79.6 79.4
Height 171.1 171.0
Hypertension 38.2 38.1
Diabetes 14.5 15.1
Never smoked 29.8 29.6
High cholesterol 34.6 34.3
Previous MI 16.9 16.5
Hypotension 8.0 8.3
Tachycardia 32.5 32.7
Anterior MI 38.9 38.9
Killip class I 85.0 85.4
ST elevation 37.3 37.8

E
• With and without adjusting for 17 baseline characteristics
Type of Analysis Log OR S.E. $$\chi^2$$
• Percent reduction in odds of death: 15% vs. 18
• -0.159 (15%) is a biased estimate
• Increase in S.E. more than offset by increase in treatment effect
• Adjusted comparison based on 19% fewer patients would have given same power as unadjusted test
Some disagree that the word bias is the appropropriate word. This could be called attenuation of the parameter estimate due to an ill-fitting model.
Code
load('gustomin.rda')
with(gustomin,
plot(density(p.sk), xlim=c(0, .4), xlab='Baseline Expected Risk',
ylab='Probability Density', main='') )
• See this for a more in-depth analysis
• Robinson & Jewell: “It is always more efficient to adjust for predictive covariates when logistic models are used, and thus in this regard the behavior of logistic regression is the same as that of classic linear regression.”
F

#### Simple Logistic Example – Gail 1986

Males A B Total
Y=0 500 900 1400
Y=1 500 100 600
Total 1000 1000 2000 OR=9

Females A B Total
Y=0 100 500 600
Y=1 900 500 1400
Total 1000 1000 2000 OR=9

Pooled A B Total
Y=0 600 1400 2000
Y=1 1400 600 2000
Total 2000 2000 4000 OR=5.44

From seeing this example one can argue that odds ratios, like hazard ratios, were never really designed to be computed on a set of subjects having heterogeneity in their expected outcomes. See fharrell.com/post/marg for more.

### 13.2.3 Nonlinear Models, General

G

• M. H. Gail et al. (1984) showed that for unadjusted treatment estimates to be unbiased, regression must be linear or exponential
• Mitchell H. Gail (1986) showed that for logistic, Cox, and paired survival models unadjusted treatment effects are asymptotically biased low in absolute value
• Gail also studied normal, exponential, additive risk, Poisson

Part of the problem with Poisson, proportional hazard and logistic regression approaches is that they use a single parameter, the linear predictor, with no equivalent of the variance parameter in the Normal case. This means that lack of fit impacts on the estimate of the predictor.
Senn (2004), p. 3747

## 13.3 Cox / Log–Rank Test for Time to Event

• Lagakos & Schoenfeld 1984 showed that type I probability is preserved if don’t adjust
• If hazards are proportional conditional on covariables, they are not proportional if omit covariables
• Morgan 1986 derived asymptotic relative efficiencies (ARE) of unadjusted log–rank test if a binary covariable is omitted
• If prevalence of covariable $$X$$ is 0.5:
H
$$X=1 : X=0$$ Hazard Ratio ARE
1.0 1.00
1.5 0.95
2.0 0.88
3.0 0.72

I
• Ford et al. (1995): Treatment effect does not have the same interpretation under unadjusted and adjusted models
• No reason for the two hazard ratios to have the same value
• Akazawa et al. (1997): Power of unadjusted and stratified log–rank test
Number Range of Power Power
1 0 .78
2 0–0.5 .77 .78
0–1 .67 .78
0–2 .36 .77
4 0–3 .35 .77
8 0–3.5 .33 .77

### 13.3.1 Sample Size Calculation Issues

• Schoenfeld (1983) implies that covariable adjustment can only $$\uparrow$$ sample size in randomized trials
• Need to recognize ill–definition of unadjusted hazard ratios
J

## 13.4 Why are Adjusted Estimates Right?

• Hauck et al. (1998), who have an excellent review of covariable adjustment in nonlinear models, state:

For use in a clinician–patient context, there is only a single person, that patient, of interest. The subject-specific measure then best reflects the risks or benefits for that patient. Gail has noted this previously [ENAR Presidential Invited Address, April 1990], arguing that one goal of a clinical trial ought to be to predict the direction and size of a treatment benefit for a patient with specific covariate values. In contrast, population–averaged estimates of treatment effect compare outcomes in groups of patients. The groups being compared are determined by whatever covariates are included in the model. The treatment effect is then a comparison of average outcomes, where the averaging is over all omitted covariates.
Hauck et al. (1998)

K

## 13.5 How Many Covariables to Use?

• Try to adjust for the bulk of the variation in outcome (Hauck et al. (1998),Tukey (1993))
• Neuhaus (1998): “to improve the efficiency of estimated covariate effects of interest, analysts of randomized clinical trial data should adjust for covariates that are strongly associated with the outcome”
• Raab et al. (2004) have more guidance for choosing covariables and provide a formula for linear model that shows how the value of adding a covariable depends on the sample size
L

## 13.6 Differential and Absolute Treatment Effects

### 13.6.1 Modeling Differential Treatment Effect

Differential treatment effect is often called heterogeneity of treatment effect or HTE. As opposed to the natural expansion of absolute treatment effect with underlying subject risk, differential treatment effect is usually based on analyses of relative effects, especially when the outcome is binary.

The most common approach to analyzing differential treatment effect involves searching for such effects rather than estimating the differential effect. This is, tragically, most often done through subgroup analysis.

#### Problems With Subgroup Analysis

M
• Subgroup analysis is widely practiced and widely derided in RCTs (as well as in observational studies)
• Construction of a subgroup from underlying factors that are continuous in nature (e.g., “older” = age $$\geq 65$$) assumes that the treatment effect is like falling off a cliff, i.e., all-or-nothing. Discontinuous treatment effects, like discontinuous main effects, have not been found and validated but have always been shown to have been an artificial construct driven by opinion and not data.
• Given a subgroup a simple label such as “class IV heart failure” may seem to be meaningful but subgrouping carries along other subject characteristics that are correlated with the subgrouping variable. So the subgroup’s treatment effect estimate has a more complex interpretation than what is thought.
• Researchers who don’t understand that “absence of evidence is not evidence for absence” interpret a significant effect in one subgroup but not in another as the treatment being effective in the first but ineffective in the second. The second $$P$$-value may be large because of increased variability in the second subgroup or because of a smaller effective sample size in that group.
• Testing the treatment effect in a subgroup does not carry along with it the covariate adjustment needed and assumes there is no residual HTE within the subgroup.
• When treatment interacts with factors not used in forming the subgroup, or when the subgrouping variable interacts with an ignored patient characteristic, the subgroup treatment effect may be grossly misleading. As an example, in GUSTO-I there was a strong interaction between Killip class and age. Unadjusted analysis of treatment effect within older subjects was partially an analysis of Killip class. And the treatment effect was not “significant” in older patients, leading many readers to conclude $$t$$–PA should not be given to them. In fact there was no evidence that age interacted with treatment, and the absolute risk reduction due to $$t$$–PA increased with age.

#### Specifying Interactions

N

• Assessing differential treatment effect best done with formal interaction tests rather than subgroup analysis
• Pre–specify sensible effect modifiers
• interactions between treatment and extent of disease
• “learned” interventions: interaction between treatment and duration of use by physician
• Interactions with center are not necessarily sensible
• Need to use penalized estimation (e.g., interaction effects as random effects) to get sufficient precision of differential treatment effects, if # interaction d.f. $$> 4$$ for example Sargent & Hodges (1996) and Yamaguchi & Ohashi (1999)

O

#### Strategy for Analyzing Differential Treatment Effect

The broad strategy that is recommended is based on the following:

P
• Anything resembling subgroup analysis should be avoided
• Anything that assumes that the treatment effect has a discontinuity in a continuous variable should be avoided. Differential effects should be smooth dose-response effects.
• Anything that results in a finding that has a simpler alternate explanation should be avoided
• Specify models so that an apparent interaction is not just a stand-in for an omitted main effect

Particulars of the strategy are:

Q
• Formulate a subject-matter driven covariate model, including all covariates understood to have strong effects on the outcome. Ensure that these covariates are adjusted for in every HTE analysis context.
• Main effects need to be flexibly modeled (e.g., using regression splines) so as to not assume linearity. False linearity assumptions can inflate apparent interaction effects because the interaction may be co-linear with omitted nonlinear main effects.
• If the sample size allows, also model interaction effects as smooth nonlinear functions. As with main effects, it is not uncommon for interaction effects to be nonlinear, e.g., the effect of treatment is small for age $$<70$$ then starts to expand rapidly after age 701. As a compromise, force interactions to be linear even if main effects are not, if the effective sample size does not permit estimating parameters for nonlinear differential effects.
• Consider effect modifiers that are somewhat uncorrelated with each other. Add the main effects corresponding to each potential effect modifier into the model described in the previous point. For example if the primary analysis uses a model containing age, sex, and severity of disease and one is interested in assessing HTE by race and by geographical region, add both race and region to the main effects that are adjusted for in all later models. This will result in proper covariate adjustment and will handle the case where an apparent interaction is partially explained by an omitted main effect that is co-linear with one of the interacting factors.
• Carry out a joint (chunk) likelihood ratio or $$F$$-test for all potential interaction effects combined. This test has a perfect multiplicity adjustment and will not be “brought down” by the potential interactions being co-linear with each other. The $$P$$-value from this joint test with multiple degrees of freedom will place the tests described in the next step in context. But note that with many d.f. the test may lack power.
• Consider each potential interaction one-at-a-time by adding that interaction term to the comprehensive main effects model. In the example given above, main effects simultaneously adjusted for would include treatment, age, sex, severity of disease, race, and region. Then treatment $$\times$$ race is added to the model and tested. Then the race interaction term (but not its main effect) is removed from the model and is replaced by region interactions.
• 1 This does not imply that age should be modeled as a categorical variable to estimate the interaction; that would result in unexplained HTE in the age $$\geq 70$$ interval.

• R

Concerning the last step, we must really recognize that there are two purposes for the analysis of differential treatment effect:

S
1. Understanding which subject characteristics are associated with alterations in efficacy
2. Predicting the efficacy for an individual subject For the latter purpose, one might include all potential treatment interactions in the statistical model, whether interactions for multiple baseline variables are co-linear or not. Such co-linearities do not hurt predictions or confidence intervals for predictions. For the former purpose, interactions can make one effect modifier compete with another, and neither modifier will be significant when adjusted for the other. As an example, suppose that the model had main effects for treatment, age, sex, and systolic and diastolic blood pressure (SBP, DBP) and we entertained effect modification due to one or both of SBP, DBP. Because of the correlation between SBP and DBP, the main effects for these variables are co-linear and their statistical significance is weakened because they estimate, for example, the effect of increasing SBP holding DBP constant (which is difficult to do). Likewise, treatment $$\times$$ SBP and treatment $$\times$$ DBP interactions are co-linear with each other, making statistical tests for interaction have low power. We may find that neither SBP nor DBP interaction is “significant” after adjusting for the other, but that the combined chunk test for SBP or DBP interaction is highly significant. Someone who does not perform the chunk test may falsely conclude that SBP does not modify the treatment effect. This problem is more acute when more than two factors are allowed to interact with treatment.

Note that including interaction terms in the model makes all treatment effect estimates conditional on specific covariate values, effectively lowering the sample size for each treatment effect estimate. When there is no HTE, the overall treatment main effect without having interaction terms in the model is by far the highest precision estimate. There is a bias-variance tradeoff when considering HTE. Adding interaction terms lowers bias but greatly increases variance.

Another way to state the HTE estimation strategy is as follows.

T
1. Pick a model for which it is mathematically possible that there be no interactions (no restrictions on model parameters)
2. Develop a model with ample flexibly-modeled main effects and clinically pre-specified interactions. Use a Bayesian skeptical prior or penalized maximum likelihood estimate to shrink the interaction terms if the effective sample size does not support the pre-specified number of interaction parameters. Be sure to model interaction effects as continuous if they come from variables of an underlying continuous nature.
3. Display and explain relative effects (e.g., odds ratios) as a function of interacting factors and link that to what is known about the mechanism of treatment effect.
4. Put all this in a context (like Figure Figure 13.7) that shows absolute treatment benefit (e.g., risk scale) as a function of background risk and of interacting factors.
5. State clearly that background risk comes from all non-treatment factors and is a risk-difference accelerator that would increase risk differences for any risk factor, not just treatment. Possibly provide a risk calculator to estimate background risk to plug into the $$x$$-axis.

Consider simulated two-treatment clinical trial data to illustrate the “always adjust for all main effects but consider interactions one at a time” approach to analyzing and displaying evidence for differential treatment effec. After that we will illustrate a simultaneous interaction analysis. Simulate time-to-event data from an exponential distribution, and fit Cox proportional hazards models to estimate the interaction between age and treatment and between sex and treatment. The true age effect is simulated as linear in the log hazard and the treatment effect on the log relative hazard scale is proportional to how far above 60 years is a patient’s age, with no treatment benefit before age 60. This is a non-simple interaction that could be exactly modeled with a linear spline function, but assume that the analyst does not know the true form of the interaction so she allows for a more general smooth form using a restricted cubic spline function. The data are simulated so that there is no sex interaction.

Code
require(rms)
options(prType='html')   # for cph print, anova
set.seed(1)
n <- 3000    # total of 3000 subjects
age <- rnorm(n, 60, 12)
label(age) <- 'Age'
sex   <- factor(sample(c('Male', 'Female'), n, rep=TRUE))
treat <- factor(sample(c('A', 'B'), n, rep=TRUE))
cens  <- 15 * runif(n)     # censoring time
h <- 0.02 * exp(0.04 * (age - 60) + 0.4 * (sex == 'Female') -
0.04 * (treat == 'B') * pmax(age - 60, 0))
dt <- -log(runif(n)) / h
label(dt) <- 'Time Until Death or Censoring'
e <- ifelse(dt <= cens, 1, 0)
dt <- pmin(dt, cens)
units(dt) <- 'Year'
S <- Surv(dt, e)
f <- cph(S ~ sex + rcs(age, 4) * treat)
f

Cox Proportional Hazards Model

 cph(formula = S ~ sex + rcs(age, 4) * treat)

Model Tests Discrimination
Indexes
Obs 3000 LR χ2 130.78 R2 0.047
Events 470 d.f. 8 R28,3000 0.040
Center 2.3448 Pr(>χ2) 0.0000 R28,470 0.230
Score χ2 156.15 Dxy 0.261
Pr(>χ2) 0.0000
β S.E. Wald Z Pr(>|Z|)
sex=Male  -0.3628  0.0936 -3.88 0.0001
age   0.0423  0.0283 1.50 0.1349
age’   0.0132  0.0655 0.20 0.8404
age’’  -0.0456  0.2499 -0.18 0.8553
treat=B   1.6377  1.6902 0.97 0.3326
age × treat=B  -0.0335  0.0357 -0.94 0.3479
age’ × treat=B   0.0643  0.0892 0.72 0.4709
age’’ × treat=B  -0.4048  0.3606 -1.12 0.2616
Code
anova(f)
 χ2 d.f. P Wald Statistics for S sex 15.03 1 0.0001 age (Factor+Higher Order Factors) 102.95 6 <0.0001 All Interactions 19.92 3 0.0002 Nonlinear (Factor+Higher Order Factors) 6.04 4 0.1961 treat (Factor+Higher Order Factors) 27.80 4 <0.0001 All Interactions 19.92 3 0.0002 age × treat (Factor+Higher Order Factors) 19.92 3 0.0002 Nonlinear 4.11 2 0.1284 Nonlinear Interaction : f(A,B) vs. AB 4.11 2 0.1284 TOTAL NONLINEAR 6.04 4 0.1961 TOTAL NONLINEAR + INTERACTION 20.80 5 0.0009 TOTAL 138.91 8 <0.0001

The model fitted above allows for a general age $$\times$$ treatment interaction. Let’s explore this interaction by plotting the age effect separately by treatment group, then plotting the treatment B:A hazard ratio as a function of age.

Re-fit the model allowing for a sex interaction (but not an age interaction) and display the results.

Code
g <- cph(S ~ sex * treat + rcs(age, 4))
g

Cox Proportional Hazards Model

 cph(formula = S ~ sex * treat + rcs(age, 4))

Model Tests Discrimination
Indexes
Obs 3000 LR χ2 108.40 R2 0.039
Events 470 d.f. 6 R26,3000 0.034
Center 1.2565 Pr(>χ2) 0.0000 R26,470 0.196
Score χ2 110.68 Dxy 0.249
Pr(>χ2) 0.0000
β S.E. Wald Z Pr(>|Z|)
sex=Male  -0.3494  0.1234 -2.83 0.0046
treat=B  -0.2748  0.1222 -2.25 0.0246
age   0.0228  0.0173 1.31 0.1890
age’   0.0452  0.0435 1.04 0.2981
age’’  -0.2088  0.1739 -1.20 0.2301
sex=Male × treat=B  -0.0526  0.1894 -0.28 0.7813
Code
anova(g)
 χ2 d.f. P Wald Statistics for S sex (Factor+Higher Order Factors) 15.87 2 0.0004 All Interactions 0.08 1 0.7813 treat (Factor+Higher Order Factors) 10.18 2 0.0062 All Interactions 0.08 1 0.7813 age 78.81 3 <0.0001 Nonlinear 1.85 2 0.3966 sex × treat (Factor+Higher Order Factors) 0.08 1 0.7813 TOTAL NONLINEAR + INTERACTION 1.92 3 0.5888 TOTAL 104.05 6 <0.0001
Code
k <- contrast(g, list(treat='B', sex=levels(sex)), list(treat='A', sex=levels(sex)))
k <- as.data.frame(k[Cs(sex,age,Contrast,Lower,Upper)])
kabl(k)
sex age Contrast Lower Upper
Female 59.7296 -0.2748 -0.5144 -0.0352
Male 59.7296 -0.3274 -0.6108 -0.0439

The above analysis signified a treatment effect for both sex groups (both confidence limits exclude a hazard ratio of 1.0), whereas this should only be the case when age exceeds 60. No sex $$\times$$ treatment interaction was indicated. Re-do the previous analysis adjusting for an age $$\times$$ treatment interaction while estimating the sex $$\times$$ treatment interaction. For this purpose we must specify an age value when estimating the treatment effects by sex. Here we set age to 50 years. Were age and sex correlated, the joint analysis would have been harder to interpret.

Code
h <- cph(S ~ treat * (rcs(age, 4) + sex))
anova(h)
 χ2 d.f. P Wald Statistics for S treat (Factor+Higher Order Factors) 27.85 5 <0.0001 All Interactions 19.98 4 0.0005 age (Factor+Higher Order Factors) 103.01 6 <0.0001 All Interactions 19.91 3 0.0002 Nonlinear (Factor+Higher Order Factors) 6.05 4 0.1956 sex (Factor+Higher Order Factors) 15.08 2 0.0005 All Interactions 0.06 1 0.8096 treat × age (Factor+Higher Order Factors) 19.91 3 0.0002 Nonlinear 4.10 2 0.1286 Nonlinear Interaction : f(A,B) vs. AB 4.10 2 0.1286 treat × sex (Factor+Higher Order Factors) 0.06 1 0.8096 TOTAL NONLINEAR 6.05 4 0.1956 TOTAL INTERACTION 19.98 4 0.0005 TOTAL NONLINEAR + INTERACTION 20.86 6 0.0019 TOTAL 138.58 9 <0.0001

This is a more accurate representation of the true underlying model, and is easy to interpret because (1) age and sex are uncorrelated in the simulation model used, and (2) only two interactions were considered.

#### Absolute vs. Relative Treatment Effects Revisited

Statistical models are typically chosen so as to maximize the likelihood of the model fitting the data and processes that generated them. Even though one may be interested in absolute effects, models must usually be based on relative effects so that quantities such as estimated risk are in the legal $$[0,1]$$ range. It is not unreasonable to think of a relative effects model such as the logistic model as providing supporting evidence that an interacting effect causes a change in the treatment effect, whereas the necessary expansion of treatment effect for higher risk subjects, as detailed in the next section, is merely a mathematical necessity for how probabilities work. One could perhaps say that any factor that confers increased risk for a subject (up to the point of diminishing returns; see below) might cause an increase in the treatment effect, but this is a general phenomenon that is spread throughout all risk factors independently of how they may directly affect the treatment effect. Because of this, additive risk models are not recommended for modeling outcomes or estimating differential treatment effects on an absolute scale. This logic leads to the following recommended strategy:

U
1. Base all inference and predicted risk estimation on a model most likely to fit the data (e.g., logistic risk model; Cox model). Choose the model so that it is possible that there may be no interactions, i.e., a model that does not place a restriction on regression coefficients.
2. Pre-specify sensible possible interactions as described earlier
3. Use estimates from this model to estimate relative differential treatment effects, which should be smooth functions of continuous baseline variables
4. Use the same model to estimate absolute risk differences (as in the next section) as a function both of interacting factors and of baseline variables in the model that do not interact with treatment (but will still expand the absolute treatment effect)
5. Even better: since expansion of risk differences is a general function of overall risk and not of individual risk factors only display effects of individual risk factors that interact with treatment (interacting on the scale that would have allowed them not to interact—the relative scale such as log odds) and show the general risk difference expansion as in Figures Figure 13.6 (think of the “risk factor” as treatment) and Figure 13.9

To translate the results of clinical trials into practice may require a lot of work involving modelling and further background information. ‘Additive at the point of analysis but relevant at the point of application’ should be the motto.

### 13.6.2 Estimating Absolute Treatment Effects

• Absolute efficacy measures:
• Risk difference ($$\delta$$)
• number needed to treat (reciprocal of risk difference)
• Years of life saved
• Binary response, no interactions with treatment, risk for control patient $$P$$:
$$\delta = P - \frac{P}{P + (1-P)/OR}$$
• $$\delta$$ is dominated by $$P$$
VW
Code
plot(0, 0, type="n", xlab="Risk for Subject Without Risk Factor",
ylab="Increase in Risk",
xlim=c(0,1), ylim=c(0,.6))
i <- 0
or <- c(1.1,1.25,1.5,1.75,2,3,4,5,10)
for(h in or) {
i <- i + 1
p <- seq(.0001, .9999, length=200)
logit <- log(p/(1 - p))  # same as qlogis(p)
logit <- logit + log(h)  # modify by odds ratio
p2 <- 1/(1 + exp(-logit))# same as plogis(logit)
d <- p2 - p
lines(p, d, lty=i)
maxd <- max(d)
smax <- p[d==maxd]
text(smax, maxd + .02, format(h), cex=.6)
}

If the outcome is such that $$Y=1$$ implies a good outcome, Figure Figure 13.6 would be useful for estimating the absolute risk increase for a “good” treatment by selecting the one curve according to the odds ratio the treatment achieved in a multivariable risk model. This assumes that the treatment does not interact with any patient characteristic(s).

van Klaveren et al. (2015) described the importance of correctly modeling interactions when estimating absolute treatment benefit.

Now consider the case where $$Y=1$$ is a bad outcome and $$Y=0$$ is a good outcome, and there is differential relative treatment effect according to a truly binary patient characteristic $$X=0,1$$. Suppose that treatment represents a new agent and the control group is patients on standard therapy. Suppose that the new treatment multiplies the odds of a bad outcome by 0.8 when $$X=0$$ and by 0.6 when $$X=1$$, and that the background risk that $$Y=1$$ for patients on standard therapy ranges from 0.01 to 0.99. The background risk could come from one or more continuous variables or mixtures of continuous and categorical patient characteristics. The “main effect” of $$X$$ must also be specified. We assume that $$X=0$$ goes into the background risk and $$X=1$$ increases the odds that $$Y=1$$ by a factor of 1.4 for patients on standard therapy. All of this specifies a full probability model that can be evaluated to show the absolute risk reduction by the new treatment as a function of background risk and $$X$$.

X
Code
require(Hmisc)
d <- expand.grid(X=0:1, brisk=seq(0.01, 0.99, length=150))
d <- upData(d,
risk.standard = plogis(qlogis(brisk) + log(1.4) * X),
risk.new      = plogis(qlogis(brisk) + log(1.4) * X +
log(0.8) * (X == 0) +
log(0.6) * (X == 1)),
risk.diff     = risk.standard - risk.new,
X = factor(X) )
Input object size:   19040 bytes;    2 variables     300 observations
Modified variable   X
New object size:    27152 bytes;    5 variables 300 observations
Code
ggplot(d, aes(x=risk.standard, y=risk.diff, color=X)) +
geom_line() +
xlab('Risk Under Standard Treatment') +
ylab('Absolute Risk Reduction With New Treatment')   # (* r ipacue() *)

It is important to note that the magnification of absolute risk reduction by increasing background risk should not be labeled (or analyzed) by any one contributing risk factor. This is a generalized effect that comes solely from the restriction of probabilities to the $$[0,1]$$ range.

#### Absolute Treatment Effects for GUSTO–I

Y

• No evidence for interactions with treatment
• Misleading subgroup analysis showed that elderly patients not benefit from $$t$$–PA; result of strong age $$\times$$ Killip class interaction
• Wide variation in absolute benefit of $$t$$–PA

Z
Code
delta <- with(gustomin, p.sk - p.tpa)
plot(density(delta), xlab='Mortality Difference',
ylab='Probability Density', main='')
m <- mean(delta)
u <- par("usr")
arrows(m, u[3], m, 0, length=.1, lwd=2)
med <- median(delta)
arrows(med, u[3], med, 0, length=.1, lwd=2)
text(med, 2, 'Median', srt=45, adj=0)
• Overall mortality difference of 0.011 dominated by high–risk patients
A
Code
load('gusto.rda')
require(rms)
f <- lrm(day30 ~ tx + age * Killip + pmin(sysbp, 120) +
lsp(pulse, 50) + pmi + miloc, data=gusto)
cat('<br><small>')

Code
f

Logistic Regression Model

 lrm(formula = day30 ~ tx + age * Killip + pmin(sysbp, 120) +
lsp(pulse, 50) + pmi + miloc, data = gusto)

Model Likelihood
Ratio Test
Discrimination
Indexes
Rank Discrim.
Indexes
Obs 40830 LR χ2 4173.41 R2 0.245 C 0.821
0 37979 d.f. 15 R215,40830 0.097 Dxy 0.642
1 2851 Pr(>χ2) <0.0001 R215,7955.8 0.407 γ 0.642
max |∂log L/∂β| 5×10-6 Brier 0.055 τa 0.083
β S.E. Wald Z Pr(>|Z|)
Intercept  -3.9541  0.7135 -5.54 <0.0001
tx=SK   0.0738  0.0512 1.44 0.1499
tx=tPA  -0.1338  0.0608 -2.20 0.0276
age   0.0867  0.0026 33.67 <0.0001
Killip=II   2.1146  0.3610 5.86 <0.0001
Killip=III   3.7596  0.7310 5.14 <0.0001
Killip=IV   4.0790  0.8259 4.94 <0.0001
sysbp  -0.0386  0.0017 -23.10 <0.0001
pulse  -0.0221  0.0141 -1.57 0.1168
pulse’   0.0416  0.0143 2.90 0.0037
pmi=yes   0.4664  0.0485 9.62 <0.0001
miloc=Other   0.3048  0.1163 2.62 0.0087
miloc=Anterior   0.5370  0.0443 12.12 <0.0001
age × Killip=II  -0.0216  0.0051 -4.22 <0.0001
age × Killip=III  -0.0363  0.0103 -3.51 0.0004
age × Killip=IV  -0.0323  0.0124 -2.61 0.0090
Code
anova(f)
 χ2 d.f. P Wald Statistics for day30 tx 15.46 2 0.0004 age (Factor+Higher Order Factors) 1390.07 4 <0.0001 All Interactions 31.13 3 <0.0001 Killip (Factor+Higher Order Factors) 427.94 6 <0.0001 All Interactions 31.13 3 <0.0001 sysbp 533.64 1 <0.0001 pulse 325.19 2 <0.0001 Nonlinear 8.43 1 0.0037 pmi 92.55 1 <0.0001 miloc 146.92 2 <0.0001 age × Killip (Factor+Higher Order Factors) 31.13 3 <0.0001 TOTAL NONLINEAR + INTERACTION 39.17 4 <0.0001 TOTAL 3167.41 15 <0.0001
Code
cat('</small>')

Code
cof <- coef(f)             # vector of regression coefficients
# For cof, X*beta without treatment coefficients estimates logit
# for SK+t-PA combination therapy (reference cell).  The coefficient for
# SK estimates the difference in logits from combo to SK.  The coefficient
# for tPA estimates the difference in tPA from combo.  The mortality
# difference of interest is mortality with SK minus mortality with tPA.
mort.sk   <- function(x) plogis(x + cof['tx=SK'])
mort.diff <- function(x)
ifelse(x < 0, mort.sk(x) - plogis(x + cof['tx=tPA']), NA)
# only define when logit < 0 since U-shaped
n <- nomogram(f, fun=list(mort.sk, mort.diff),
funlabel=c("30-Day Mortality\nFor SK Treatment",
"Mortality Reduction by t-PA"),
fun.at=list(c(.001,.005,.01,.05,.1,.2,.5,.7,.9),
c(.001,.005,.01,.02,.03,.04,.05)),
pulse=seq(0,260,by=10), omit='tx', lp=FALSE)
Code
plot(n, varname.label.sep=' ', xfrac=.27, lmgp=.2, cex.axis=.6)

#### Absolute Benefit on Survival Prob.

B

• Cox PH model
• Modeling can uncover time course of treatment effect
• $$X_1$$ = treatment, $$A={X_{2},\ldots,X_{p}}$$ adjustment variables
• Survival difference is
$$S(t | X_{1}=1, A) - S(t | X_{1}=0, A) = S(t | X_{1}=0, A)^{HR} - S(t | X_{1}=0, A)$$

C
Code
plot(0, 0, type="n", xlab="Survival for Control Subject",
ylab="Improvement in Survival",
xlim=c(0,1), ylim=c(0,.7))
i <- 0
hr <- seq(.1, .9, by=.1)
for(h in hr) {
i <- i + 1
p <- seq(.0001, .9999, length=200)
p2 <- p ^ h
d <- p2 - p
lines(p, d, lty=i)
maxd <- max(d)
smax <- p[d==maxd]
text(smax,maxd+.02, format(h), cex=.6)
}

## 13.7 Cost–Effectiveness Ratios

• Effectiveness E (denominator of C–E ratio) is always absolute
• Absolute treatment effectiveness varies greatly with patient characteristics
• ⇒ C–E ratio varies greatly
• A C–E ratio based on average E and average C may not apply to any existing patient!
• Need a model to estimate E
• C may also depend on patient characteristics
D

E
Code
cost.life <- 2400 / delta / 1e6
plot(density(cost.life), xlab='Cost Per Life Saved, \$M', main='',
ylab='Probability Density', xlim=c(0, 6))
m <- 2400 / mean(delta) / 1e6
u <- par("usr")
arrows(m, u[3], m, 0, length=.1, lwd=2)
text(m,.01,'Cost using\n  average\n    reduction',srt=45,adj=0)

## 13.8 Treatment Contrasts for Multi–Site Randomized Trials

• Primary model: covariables, treatment, site main effects

• Planned secondary model to assess consistency of treatment effects over sites (add site $$\times$$ treatment interactions)

• Advantages for considering sites as random effects (or use penalized MLE to shrink site effects, especially for small sites). See Andersen et al. (1999) for a random effects Cox model and a demonstration that treatment effects may be inconsistent when non–zero site main effects are ignored in the Cox model. See also Yamaguchi & Ohashi (1999).

• Types of tests / contrasts when interactions are included (Schwemer (2000)):

• Type I: not adjusted for center
• Type II: average treatment effect, weighted by size of sites
R rms package code:
sites <- levels(site)
contrast(fit, list(treat='b', site=sites),
list(treat='a', site=sites),
type='average', weights=table(site))
• Type III: average treatment effect, unweighted
contrast(fit, list(treat='b', site=sites),
list(treat='a', site=sites), type='average')

Low precision; studies are not powered for Type III tests.

• Another interesting test: combined effects of treatment and site $$\times$$ treatment interaction; tests whether treatment was effective at any site.

FG

## 13.9 Statistical Plan for Randomized Trials

• FDA draft guidance - advocates for many forms of covariate adjustment
• When a relevant dataset is available before the trial begins, develop the model from the dataset and use the predicted value as a single adjustment covariable in the trial (Knaus et al. (1993))
• Otherwise: CPMP Working Party: Finalize choice of model, transformations, interactions before merging treatment assignment into analysis dataset.
Edwards (1999): Pre–specify family of models that will be used, along with the strategy for selecting the particular model.
Masked model derivation does not bias treatment effect.
• CPMP guidance
• Stratification may be used to ensure balance of treatments across covariates; it may also be used for administrative reasons. The factors that are the basis of stratification should normally be included as covariates in the primary model.
• Variables known a priori to be strongly, or at least moderately, associated with the primary outcome and/or variables for which there is a strong clinical rationale for such an association should also be considered as covariates in the primary analysis. The variables selected on this basis should be pre-specified in the protocol or the statistical analysis plan.
• Baseline imbalance observed post hoc should not be considered an appropriate reason for including a variable as a covariate in the primary analysis.
• Variables measured after randomization and so potentially affected by the treatment should not normally be included as covariates in the primary analysis.
• If a baseline value of a continuous outcome measure is available, then this should usually be included as a covariate. This applies whether the primary outcome variable is defined as the ‘raw outcome’ or as the ‘change from baseline’.
• Only a few covariates should be included in a primary analysis. Although larger data sets may support more covariates than smaller ones, justification for including each of the covariates should be provided. (???)
• In the absence of prior knowledge, a simple functional form (usually either linearity or dichotomising a continuous scale) should be assumed for the relationship between a continuous covariate and the outcome variable. (???)
• The validity of the model assumptions must be checked when assessing the results. This is particularly important for generalized linear or non-linear models where mis-specification could lead to incorrect estimates of the treatment effect. Even under ordinary linear models, some attention should be paid to the possible influence of extreme outlying values.
• Whenever adjusted analyses are presented, results of the treatment effect in subgroups formed by the covariates (appropriately categorised, if relevant) should be presented to enable an assessment of the validity of the model assumptions. (???)
• Sensitivity analyses should be pre-planned and presented to investigate the robustness of the primary results. Discrepancies should be discussed and explained. In the presence of important differences that cannot be logically explained-for example, between the results of adjusted and unadjusted analyses-the interpretation of the trial could be seriously affected.
• The primary model should not include treatment by covariate interactions. If substantial interactions are expected a priori, the trial should be designed to allow separate estimates of the treatment effects in specific subgroups.
• Exploratory analyses may be carried out to improve the understanding of covariates not included in the primary analysis, and to help the sponsor with the ongoing development of the drug.
• A primary analysis, unambiguously pre-specified in the protocol or statistical analysis plan, correctly carried out and interpreted, should support the conclusions which are drawn from the trial. Since there may be a number of alternative valid analyses, results based on pre-specified analyses will carry most credibility.’’
H

In confirmatory trials, a model is pre-specified, and it is necessary to pretend that it is true. In most other statistical applications, the choice of model is data–driven, but it is necessary to pretend that it is not.

• Choose predictors based on expert opinion
• Impute missing values rather than discarding observations
• Keep all pre–specified predictors in model, regardless of $$P$$–value
• Use shrinkage (penalized maximum likelihood estimation) to avoid over–adjustment
• Some guidance for handling missing baseline data in RCTs is in White & Thompson (2005)
I

### 13.9.1 Sites vs. Covariables

• Site effects (main or interaction) are almost always trivial in comparison with patient-specific covariable effects
• It is not sensible to include site in the model when important covariables are omitted
• The most logical and usually the most strong interaction with treatment is not site but is the severity of disease being treated
J

### 13.9.2 Covariable Adjustment vs. Allocation Based on Covariates

K

The decision to fit prognostic factors has a far more dramatic effect on the precision of our inferences than the choice of an allocation based on covariates or randomization approach and one of my chief objections to the allocation based on covariates approach is that trialists have tended to use the fact that they have balanced as an excuse for not fitting. This is a grave mistake.

My view .. was that the form of analysis envisaged (that is to say, which factors and covariates should be fitted) justified the allocation and not vice versa.
Senn (2004), p. 3747

## 13.10 Robustness

Regression models relate some transformation, called the link function, of $$Y$$ to the linear predictor $$X\beta$$. For multiple linear regression the link function is the identity function. For binary $$Y$$ the most popular link function is the log odds (logit) but other links are possible. White et al. (2021) have shown that when a standard link function (e.g., logit for binary $$Y$$ and the identity function for continuous $$Y$$) is used, analysis of covariance is robust to model misspecification. See here for simulations supporting this for binary logistic regression. For a proportional odds ordinal logistic model, a simulated example shows robustness even in the face of adjusting for a very important covariate that causes an extreme violation of the proportional odds assumption. The effect of such a violation is to effectively ignore the covariate, making the adjusted analysis virtually the same as the unadjusted analysis if no other covariates are present.

The primary reason for the result of White et al. (2021) is their demonstration that under an appropriate link function (e.g., one that does not artificially constrain the parameters as an additive risk model does) the adjusted treatment effect is zero if and only if the unadjusted treatment effect is zero.

In general, violation of model assumptions about treatment effect parameters do not affect type I assertion probability $$\alpha$$. That is because under the null hypothesis that the treatment is ignorable, there are no assumptions to be made about the form of treatment effects. For example in a proportional odds ordinal logistic model, under $$H_0$$ the treatment effect must operate in a proportional odds fashion (i.e., all odds ratios involving treatment are 1.0 so they are all equal). It is impossible to violate the proportional odds assumption under the null.

An example of poor model fitting where type I assertion probability $$\alpha$$ for testing treatment the difference was not preserved is in the analysis of longitudinal ordinal response under a first-order Markov process when there is a strong time effect that is mis-modeled. If the mixture of outcome states changes arbitrarily over time so that time operates in a non-proportional odds manner but proportional odds is falsely assumed, the ordinary standard error estimates for the treatment effect are too small and $$\alpha$$ is significantly larger than the intended 0.05. Using the cluster sandwich covariance matrix estimator in place of the usual information matrix-based standard errors completely fixed that problem.

But as shown in the linked report, a complete solution is had by using a partial proportional odds model to relax the proportional odds assumption with respect to absolute time.

## 13.11 Summary

The point of view is sometimes defended that analyses that ignore covariates are superior because they are simpler. I do not accept this. A value of $$\pi=3$$ is a simple one and accurate to one significant figure … However very few would seriously maintain that if should generally be adopted by engineers.
Senn (2004), p. 3741

L
• As opposed to simple treatment group comparisons, modeling can
• Improve precision (linear, log–linear models)
• Get the “right” treatment effect (nonlinear models)
• Improve power (almost all models)
• Uncover outcome patterns, shapes of effects
• Test/estimate differential treatment benefit
• Determine whether some patients are too sick or too well to benefit
• Estimate absolute clinical benefit as a function of severity of illness
• Estimate meaningful cost–effectiveness ratios
• Help design the next clinical trial (optimize risk distribution for maximum power)
• Modeling strategy must be well thought–out
• Not “data mining”
• Not done to optimize the treatment $$P$$–value
M

## 13.12 Notes

I think it is most important to decide what it is you want to estimate, and then formulate a model that will accomplish that. Unlike ordinary linear models, which provide unbiased treatment effects if balanced covariates are mistakenly omitted from the model in an RCT, most models (such as the Cox PH model) result in biased treatment effects even when there is perfect balance in covariates, if the covariates have nonzero effects on the outcome. This is another way of talking about residual outcome heterogeneity.

If you want to estimate the effect of variable $$X$$ on survival time, averaging over males and females in some strange undocumented way, you can get the population averaged effect of $$X$$ without including sex in the model. Recognize however this is like comparing some of the males with some of the females when estimating the $$X$$ effect. This is seldom of interest. More likely we want to know the effect of $$X$$ for males, the effect for females, and if there is no interaction we pool the two to more precisely estimate the effect of $$X$$ conditional on sex. Another way to view this is that the PH assumption is more likely to hold when you condition on covariates than when you don’t. No matter what happens though, if PH holds for one case, it cannot hold for the other, e.g., if PH holds after conditioning, it cannot hold when just looking at the marginal effect of X.
— From a posting by Harrell to the Medstats google group on 19Jan09

See the excellent Tweetorial by Darren Dahly here and this supplement to it by Ben Andrew.