13  Analysis of Covariance in Randomized Studies

Comments


Introduction

The model to use depends on the nature of the outcome variable Y, and different models have different effect measures:


Hierarchy of Causal Inference for Treatment Efficacy

Let \(P_{i}\) denote patient \(i\) and the treatments be denoted by \(A\) and \(B\). Thus \(P_{2}^{B}\) represents patient 2 on treatment \(B\). \(\overline{P}_{1}\) represents the average outcome over a sample of patients from which patient 1 was selected.

Design Patients Compared
6-period crossover \(P_{1}^{A}\) vs \(P_{1}^{B}\) (directly measure HTE)
2-period crossover \(P_{1}^{A}\) vs \(P_{1}^{B}\)
RCT in idential twins \(P_{1}^{A}\) vs \(P_{1}^{B}\)
\(\parallel\) group RCT \(\overline{P}_{1}^{A}\) vs \(\overline{P}_{2}^{B}\), \(P_{1}=P_{2}\) on avg
Observational, good artificial control \(\overline{P}_{1}^{A}\) vs \(\overline{P}_{2}^{B}\), \(P_{1}=P_{2}\) hopefully on avg
Observational, poor artificial control \(\overline{P}_{1}^{A}\) vs \(\overline{P}_{2}^{B}\), \(P_{1}\neq P_{2}\) on avg
Real-world physician practice \(P_{1}^{A}\) vs \(P_{2}^{B}\)

13.1 Covariable Adjustment in Linear Models

If you fail to adjust for pre-specified covariates, the statistical model’s residuals get larger. Not good for power, but incorporates the uncertainties needed for any possible random baseline imbalances.
— F Harrell 2019

A
  • Model: \(E(Y | X) = X \beta + \epsilon\)
  • Continuous response variable \(Y\), normal residuals
  • Statistical testing for baseline differences is scientifically incorrect (Altman & Doré 1990, Begg 1990, Senn 1994, Austin et al 2010); as Bland & Altman (2011) stated, statistical tests draw inferences about populations, and the population model here would involve a repeat of the randomization to the whole population hence balance would be perfect. Therefore the null hypothesis of no difference in baseline distributions between treatment groups is automatically true.
  • If we are worried about baseline imbalance we need to search patient records for counter– balancing factors
  • ⇒ imbalance is not the reason to adjust for covariables
  • Adjust to gain efficiency by subtracting explained variation
  • Relative efficiency of unadjusted treatment comparison is \(1-\rho^2\)
  • Unadjusted analyses yields unbiased treatment effect estimate

See here for a detailed discussion, including reasons not to even stratify by treatment in “Table 1.”

13.2 Covariable Adjustment in Nonlinear Models


13.2.1 Hidden Assumptions in \(2 \times 2\) Tables

B
  • Traditional presentation of 2–treatment clinical trial with a binary response: \(2 \times 2\) table
  • Parameters: \(P_{1}, P_{2}\) for treatments 1, 2
  • Test of goodness of fit: \(H_0\): all patients in one treatment group have same probability of positive response (\(P_j\) constant)
  • \(H_0\): no risk factors exist
  • Need to account for patient heterogeneity
  • Matthews & Badi (2015) has a method for estimating the bias in unadjusted the log odds ratio and also has excellent background information

13.2.2 Models for Binary Response

C
  • Model for probability of event must be nonlinear in predictors unless risk range is tiny
  • Useful summary of relative treatment effect is the odds ratio (OR)
  • Use of binary logistic model for covariable adjustment will result in an {} in the S.E. of the treatment effect (log odds ratio) (Robinson & Jewell (1991))
  • But even with perfect balance, adjusted OR \(\neq\) unadjusted OR
  • Adjusted OR will be greater than unadjusted OR

Example from GUSTO–I

D

  • Steyerberg et al. (2000)
  • Endpoint: 30–day mortality (0.07 overall)
  • 10,348 patients given accelerated \(t\)–PA
  • 20,162 patients given streptokinase (SK)
  • Means and Percentages
Baseline Characteristic \(t\)–PA SK
Age 61.0 60.9
Female 25.3 25.3
Weight 79.6 79.4
Height 171.1 171.0
Hypertension 38.2 38.1
Diabetes 14.5 15.1
Never smoked 29.8 29.6
High cholesterol 34.6 34.3
Previous MI 16.9 16.5
Hypotension 8.0 8.3
Tachycardia 32.5 32.7
Anterior MI 38.9 38.9
Killip class I 85.0 85.4
ST elevation 37.3 37.8

Unadjusted / Adj. Logistic Estimates

E
  • With and without adjusting for 17 baseline characteristics
Type of Analysis Log OR S.E. \(\chi^2\)
Unadjusted -0.159 0.049 10.8
Adjusted -0.198 0.053 14.0
  • Percent reduction in odds of death: 15% vs. 18
  • -0.159 (15%) is a biased estimate
  • Increase in S.E. more than offset by increase in treatment effect
  • Adjusted comparison based on 19% fewer patients would have given same power as unadjusted test
Some disagree that the word bias is the appropropriate word. This could be called attenuation of the parameter estimate due to an ill-fitting model.
Code
load('gustomin.rda')
with(gustomin,
     plot(density(p.sk), xlim=c(0, .4), xlab='Baseline Expected Risk',
          ylab='Probability Density', main='') )

Figure 13.1: Distribution of baseline risk in GUSTO-I. Kernel density estimate of risk distribution for SK treatment. Average risk is 0.07. See also Ioannidis & Lau (1997).

  • See this for a more in-depth analysis
  • Robinson & Jewell: “It is always more efficient to adjust for predictive covariates when logistic models are used, and thus in this regard the behavior of logistic regression is the same as that of classic linear regression.”
F

Simple Logistic Example – Gail 1986

Males A B Total
Y=0 500 900 1400
Y=1 500 100 600
Total 1000 1000 2000 OR=9


Females A B Total
Y=0 100 500 600
Y=1 900 500 1400
Total 1000 1000 2000 OR=9


Pooled A B Total
Y=0 600 1400 2000
Y=1 1400 600 2000
Total 2000 2000 4000 OR=5.44

From seeing this example one can argue that odds ratios, like hazard ratios, were never really designed to be computed on a set of subjects having heterogeneity in their expected outcomes. See fharrell.com/post/marg for more.

13.2.3 Nonlinear Models, General

G

  • M. H. Gail et al. (1984) showed that for unadjusted treatment estimates to be unbiased, regression must be linear or exponential
  • Mitchell H. Gail (1986) showed that for logistic, Cox, and paired survival models unadjusted treatment effects are asymptotically biased low in absolute value
  • Gail also studied normal, exponential, additive risk, Poisson

Part of the problem with Poisson, proportional hazard and logistic regression approaches is that they use a single parameter, the linear predictor, with no equivalent of the variance parameter in the Normal case. This means that lack of fit impacts on the estimate of the predictor.
Senn (2004), p. 3747

13.3 Cox / Log–Rank Test for Time to Event

  • Lagakos & Schoenfeld 1984 showed that type I probability is preserved if don’t adjust
  • If hazards are proportional conditional on covariables, they are not proportional if omit covariables
  • Morgan 1986 derived asymptotic relative efficiencies (ARE) of unadjusted log–rank test if a binary covariable is omitted
  • If prevalence of covariable \(X\) is 0.5:
H
\(X=1 : X=0\) Hazard Ratio ARE
1.0 1.00
1.5 0.95
2.0 0.88
3.0 0.72

I
  • Ford et al. (1995): Treatment effect does not have the same interpretation under unadjusted and adjusted models
  • No reason for the two hazard ratios to have the same value
  • Akazawa et al. (1997): Power of unadjusted and stratified log–rank test
Number Range of Power Power
of Strata Log Hazards Unadj. Adjusted
1 0 .78
2 0–0.5 .77 .78
0–1 .67 .78
0–2 .36 .77
4 0–3 .35 .77
8 0–3.5 .33 .77

13.3.1 Sample Size Calculation Issues

  • Schoenfeld (1983) implies that covariable adjustment can only \(\uparrow\) sample size in randomized trials
  • Need to recognize ill–definition of unadjusted hazard ratios
J

13.4 Why are Adjusted Estimates Right?

  • Hauck et al. (1998), who have an excellent review of covariable adjustment in nonlinear models, state:

For use in a clinician–patient context, there is only a single person, that patient, of interest. The subject-specific measure then best reflects the risks or benefits for that patient. Gail has noted this previously [ENAR Presidential Invited Address, April 1990], arguing that one goal of a clinical trial ought to be to predict the direction and size of a treatment benefit for a patient with specific covariate values. In contrast, population–averaged estimates of treatment effect compare outcomes in groups of patients. The groups being compared are determined by whatever covariates are included in the model. The treatment effect is then a comparison of average outcomes, where the averaging is over all omitted covariates.
Hauck et al. (1998)

K

13.5 How Many Covariables to Use?

  • Try to adjust for the bulk of the variation in outcome (Hauck et al. (1998),Tukey (1993))
  • Neuhaus (1998): “to improve the efficiency of estimated covariate effects of interest, analysts of randomized clinical trial data should adjust for covariates that are strongly associated with the outcome”
  • Raab et al. (2004) have more guidance for choosing covariables and provide a formula for linear model that shows how the value of adding a covariable depends on the sample size
L

13.6 Differential and Absolute Treatment Effects

13.6.1 Modeling Differential Treatment Effect

Differential treatment effect is often called heterogeneity of treatment effect or HTE. As opposed to the natural expansion of absolute treatment effect with underlying subject risk, differential treatment effect is usually based on analyses of relative effects, especially when the outcome is binary.

The most common approach to analyzing differential treatment effect involves searching for such effects rather than estimating the differential effect. This is, tragically, most often done through subgroup analysis.

Problems With Subgroup Analysis

M
  • Subgroup analysis is widely practiced and widely derided in RCTs (as well as in observational studies)
  • Construction of a subgroup from underlying factors that are continuous in nature (e.g., “older” = age \(\geq 65\)) assumes that the treatment effect is like falling off a cliff, i.e., all-or-nothing. Discontinuous treatment effects, like discontinuous main effects, have not been found and validated but have always been shown to have been an artificial construct driven by opinion and not data.
  • Given a subgroup a simple label such as “class IV heart failure” may seem to be meaningful but subgrouping carries along other subject characteristics that are correlated with the subgrouping variable. So the subgroup’s treatment effect estimate has a more complex interpretation than what is thought.
  • Researchers who don’t understand that “absence of evidence is not evidence for absence” interpret a significant effect in one subgroup but not in another as the treatment being effective in the first but ineffective in the second. The second \(P\)-value may be large because of increased variability in the second subgroup or because of a smaller effective sample size in that group.
  • Testing the treatment effect in a subgroup does not carry along with it the covariate adjustment needed and assumes there is no residual HTE within the subgroup.
  • When treatment interacts with factors not used in forming the subgroup, or when the subgrouping variable interacts with an ignored patient characteristic, the subgroup treatment effect may be grossly misleading. As an example, in GUSTO-I there was a strong interaction between Killip class and age. Unadjusted analysis of treatment effect within older subjects was partially an analysis of Killip class. And the treatment effect was not “significant” in older patients, leading many readers to conclude \(t\)–PA should not be given to them. In fact there was no evidence that age interacted with treatment, and the absolute risk reduction due to \(t\)–PA increased with age.

Specifying Interactions

N

  • Assessing differential treatment effect best done with formal interaction tests rather than subgroup analysis
  • Pre–specify sensible effect modifiers
    • interactions between treatment and extent of disease
    • “learned” interventions: interaction between treatment and duration of use by physician
  • Interactions with center are not necessarily sensible
  • Need to use penalized estimation (e.g., interaction effects as random effects) to get sufficient precision of differential treatment effects, if # interaction d.f. \(> 4\) for example Sargent & Hodges (1996) and Yamaguchi & Ohashi (1999)

O

Figure 13.2: A display of an interaction between treatment, extent of disease, and calendar year of start of treatment (Califf et al. (1989))

Strategy for Analyzing Differential Treatment Effect

The broad strategy that is recommended is based on the following:

P
  • Anything resembling subgroup analysis should be avoided
  • Anything that assumes that the treatment effect has a discontinuity in a continuous variable should be avoided. Differential effects should be smooth dose-response effects.
  • Anything that results in a finding that has a simpler alternate explanation should be avoided
    • Specify models so that an apparent interaction is not just a stand-in for an omitted main effect

Particulars of the strategy are:

Q
  • Formulate a subject-matter driven covariate model, including all covariates understood to have strong effects on the outcome. Ensure that these covariates are adjusted for in every HTE analysis context.
  • Main effects need to be flexibly modeled (e.g., using regression splines) so as to not assume linearity. False linearity assumptions can inflate apparent interaction effects because the interaction may be co-linear with omitted nonlinear main effects.
  • If the sample size allows, also model interaction effects as smooth nonlinear functions. As with main effects, it is not uncommon for interaction effects to be nonlinear, e.g., the effect of treatment is small for age \(<70\) then starts to expand rapidly after age 701. As a compromise, force interactions to be linear even if main effects are not, if the effective sample size does not permit estimating parameters for nonlinear differential effects.
  • Consider effect modifiers that are somewhat uncorrelated with each other. Add the main effects corresponding to each potential effect modifier into the model described in the previous point. For example if the primary analysis uses a model containing age, sex, and severity of disease and one is interested in assessing HTE by race and by geographical region, add both race and region to the main effects that are adjusted for in all later models. This will result in proper covariate adjustment and will handle the case where an apparent interaction is partially explained by an omitted main effect that is co-linear with one of the interacting factors.
  • Carry out a joint (chunk) likelihood ratio or \(F\)-test for all potential interaction effects combined. This test has a perfect multiplicity adjustment and will not be “brought down” by the potential interactions being co-linear with each other. The \(P\)-value from this joint test with multiple degrees of freedom will place the tests described in the next step in context. But note that with many d.f. the test may lack power.
  • Consider each potential interaction one-at-a-time by adding that interaction term to the comprehensive main effects model. In the example given above, main effects simultaneously adjusted for would include treatment, age, sex, severity of disease, race, and region. Then treatment \(\times\) race is added to the model and tested. Then the race interaction term (but not its main effect) is removed from the model and is replaced by region interactions.
  • 1 This does not imply that age should be modeled as a categorical variable to estimate the interaction; that would result in unexplained HTE in the age \(\geq 70\) interval.

  • R

    Concerning the last step, we must really recognize that there are two purposes for the analysis of differential treatment effect:

    S
    1. Understanding which subject characteristics are associated with alterations in efficacy
    2. Predicting the efficacy for an individual subject For the latter purpose, one might include all potential treatment interactions in the statistical model, whether interactions for multiple baseline variables are co-linear or not. Such co-linearities do not hurt predictions or confidence intervals for predictions. For the former purpose, interactions can make one effect modifier compete with another, and neither modifier will be significant when adjusted for the other. As an example, suppose that the model had main effects for treatment, age, sex, and systolic and diastolic blood pressure (SBP, DBP) and we entertained effect modification due to one or both of SBP, DBP. Because of the correlation between SBP and DBP, the main effects for these variables are co-linear and their statistical significance is weakened because they estimate, for example, the effect of increasing SBP holding DBP constant (which is difficult to do). Likewise, treatment \(\times\) SBP and treatment \(\times\) DBP interactions are co-linear with each other, making statistical tests for interaction have low power. We may find that neither SBP nor DBP interaction is “significant” after adjusting for the other, but that the combined chunk test for SBP or DBP interaction is highly significant. Someone who does not perform the chunk test may falsely conclude that SBP does not modify the treatment effect. This problem is more acute when more than two factors are allowed to interact with treatment.

    Note that including interaction terms in the model makes all treatment effect estimates conditional on specific covariate values, effectively lowering the sample size for each treatment effect estimate. When there is no HTE, the overall treatment main effect without having interaction terms in the model is by far the highest precision estimate. There is a bias-variance tradeoff when considering HTE. Adding interaction terms lowers bias but greatly increases variance.

    Another way to state the HTE estimation strategy is as follows.

    T
    1. Pick a model for which it is mathematically possible that there be no interactions (no restrictions on model parameters)
    2. Develop a model with ample flexibly-modeled main effects and clinically pre-specified interactions. Use a Bayesian skeptical prior or penalized maximum likelihood estimate to shrink the interaction terms if the effective sample size does not support the pre-specified number of interaction parameters. Be sure to model interaction effects as continuous if they come from variables of an underlying continuous nature.
    3. Display and explain relative effects (e.g., odds ratios) as a function of interacting factors and link that to what is known about the mechanism of treatment effect.
    4. Put all this in a context (like Figure Figure 13.7) that shows absolute treatment benefit (e.g., risk scale) as a function of background risk and of interacting factors.
    5. State clearly that background risk comes from all non-treatment factors and is a risk-difference accelerator that would increase risk differences for any risk factor, not just treatment. Possibly provide a risk calculator to estimate background risk to plug into the \(x\)-axis.

    Consider simulated two-treatment clinical trial data to illustrate the “always adjust for all main effects but consider interactions one at a time” approach to analyzing and displaying evidence for differential treatment effec. After that we will illustrate a simultaneous interaction analysis. Simulate time-to-event data from an exponential distribution, and fit Cox proportional hazards models to estimate the interaction between age and treatment and between sex and treatment. The true age effect is simulated as linear in the log hazard and the treatment effect on the log relative hazard scale is proportional to how far above 60 years is a patient’s age, with no treatment benefit before age 60. This is a non-simple interaction that could be exactly modeled with a linear spline function, but assume that the analyst does not know the true form of the interaction so she allows for a more general smooth form using a restricted cubic spline function. The data are simulated so that there is no sex interaction.

    Code
    require(rms)
    options(prType='html')   # for cph print, anova
    set.seed(1)
    n <- 3000    # total of 3000 subjects
    age <- rnorm(n, 60, 12)
    label(age) <- 'Age'
    sex   <- factor(sample(c('Male', 'Female'), n, rep=TRUE))
    treat <- factor(sample(c('A', 'B'), n, rep=TRUE))
    cens  <- 15 * runif(n)     # censoring time
    h <- 0.02 * exp(0.04 * (age - 60) + 0.4 * (sex == 'Female') -
                    0.04 * (treat == 'B') * pmax(age - 60, 0))
    dt <- -log(runif(n)) / h
    label(dt) <- 'Time Until Death or Censoring'
    e <- ifelse(dt <= cens, 1, 0)
    dt <- pmin(dt, cens)
    units(dt) <- 'Year'
    dd <- datadist(age, sex, treat); options(datadist='dd')
    S <- Surv(dt, e)
    f <- cph(S ~ sex + rcs(age, 4) * treat)
    f

    Cox Proportional Hazards Model

     cph(formula = S ~ sex + rcs(age, 4) * treat)
     
    Model Tests Discrimination
    Indexes
    Obs 3000 LR χ2 130.78 R2 0.047
    Events 470 d.f. 8 R28,3000 0.040
    Center 2.3448 Pr(>χ2) 0.0000 R28,470 0.230
    Score χ2 156.15 Dxy 0.261
    Pr(>χ2) 0.0000
    β S.E. Wald Z Pr(>|Z|)
    sex=Male  -0.3628  0.0936 -3.88 0.0001
    age   0.0423  0.0283 1.50 0.1349
    age’   0.0132  0.0655 0.20 0.8404
    age’’  -0.0456  0.2499 -0.18 0.8553
    treat=B   1.6377  1.6902 0.97 0.3326
    age × treat=B  -0.0335  0.0357 -0.94 0.3479
    age’ × treat=B   0.0643  0.0892 0.72 0.4709
    age’’ × treat=B  -0.4048  0.3606 -1.12 0.2616
    Code
    anova(f)
    Wald Statistics for S
    χ2 d.f. P
    sex 15.03 1 0.0001
    age (Factor+Higher Order Factors) 102.95 6 <0.0001
    All Interactions 19.92 3 0.0002
    Nonlinear (Factor+Higher Order Factors) 6.04 4 0.1961
    treat (Factor+Higher Order Factors) 27.80 4 <0.0001
    All Interactions 19.92 3 0.0002
    age × treat (Factor+Higher Order Factors) 19.92 3 0.0002
    Nonlinear 4.11 2 0.1284
    Nonlinear Interaction : f(A,B) vs. AB 4.11 2 0.1284
    TOTAL NONLINEAR 6.04 4 0.1961
    TOTAL NONLINEAR + INTERACTION 20.80 5 0.0009
    TOTAL 138.91 8 <0.0001

    The model fitted above allows for a general age \(\times\) treatment interaction. Let’s explore this interaction by plotting the age effect separately by treatment group, then plotting the treatment B:A hazard ratio as a function of age.

    Code
    ggplot(Predict(f, age, treat), rdata=data.frame(age, treat))
    ages <- seq(30, 87, length=200)
    k <- contrast(f, list(treat='B', age=ages), list(treat='A', age=ages))
    k <- as.data.frame(k[Cs(sex,age,Contrast,Lower,Upper)])
    ggplot(k, aes(x=age, y=exp(Contrast))) +
      scale_y_log10(minor_breaks=seq(.2, .9, by=.1)) +
      geom_ribbon(aes(ymin=exp(Lower), ymax=exp(Upper)), fill='gray80') +
      geom_line() +
      ylab('B:A Hazard Ratio') + xlab('Age')

    (a) Age effect separately by treatment

    (b) B:A hazard ratio vs. age

    Figure 13.3: Estimates from model with age \(\times\) treatment interaction

    Re-fit the model allowing for a sex interaction (but not an age interaction) and display the results.

    Code
    g <- cph(S ~ sex * treat + rcs(age, 4))
    g

    Cox Proportional Hazards Model

     cph(formula = S ~ sex * treat + rcs(age, 4))
     
    Model Tests Discrimination
    Indexes
    Obs 3000 LR χ2 108.40 R2 0.039
    Events 470 d.f. 6 R26,3000 0.034
    Center 1.2565 Pr(>χ2) 0.0000 R26,470 0.196
    Score χ2 110.68 Dxy 0.249
    Pr(>χ2) 0.0000
    β S.E. Wald Z Pr(>|Z|)
    sex=Male  -0.3494  0.1234 -2.83 0.0046
    treat=B  -0.2748  0.1222 -2.25 0.0246
    age   0.0228  0.0173 1.31 0.1890
    age’   0.0452  0.0435 1.04 0.2981
    age’’  -0.2088  0.1739 -1.20 0.2301
    sex=Male × treat=B  -0.0526  0.1894 -0.28 0.7813
    Code
    anova(g)
    Wald Statistics for S
    χ2 d.f. P
    sex (Factor+Higher Order Factors) 15.87 2 0.0004
    All Interactions 0.08 1 0.7813
    treat (Factor+Higher Order Factors) 10.18 2 0.0062
    All Interactions 0.08 1 0.7813
    age 78.81 3 <0.0001
    Nonlinear 1.85 2 0.3966
    sex × treat (Factor+Higher Order Factors) 0.08 1 0.7813
    TOTAL NONLINEAR + INTERACTION 1.92 3 0.5888
    TOTAL 104.05 6 <0.0001
    Code
    k <- contrast(g, list(treat='B', sex=levels(sex)), list(treat='A', sex=levels(sex)))
    k <- as.data.frame(k[Cs(sex,age,Contrast,Lower,Upper)])
    kabl(k)
    sex age Contrast Lower Upper
    Female 59.7296 -0.2748 -0.5144 -0.0352
    Male 59.7296 -0.3274 -0.6108 -0.0439
    Code
    ggplot(Predict(g, sex, treat))
    ggplot(k, aes(y=exp(Contrast), x=sex)) + geom_point() +
      scale_y_log10(breaks=c(.5, .6, .7, .8, .9, 1, 1.1)) +
      geom_linerange(aes(ymin=exp(Lower), ymax=exp(Upper))) +
      xlab('Sex') + ylab('B:A Hazard Ratio') + coord_flip()

    (a) Partial effects plot

    (b) B:A hazard ratio by sex

    Figure 13.4: Effects of predictors in Cox model

    The above analysis signified a treatment effect for both sex groups (both confidence limits exclude a hazard ratio of 1.0), whereas this should only be the case when age exceeds 60. No sex \(\times\) treatment interaction was indicated. Re-do the previous analysis adjusting for an age \(\times\) treatment interaction while estimating the sex \(\times\) treatment interaction. For this purpose we must specify an age value when estimating the treatment effects by sex. Here we set age to 50 years. Were age and sex correlated, the joint analysis would have been harder to interpret.

    Code
    h <- cph(S ~ treat * (rcs(age, 4) + sex))
    anova(h)
    Wald Statistics for S
    χ2 d.f. P
    treat (Factor+Higher Order Factors) 27.85 5 <0.0001
    All Interactions 19.98 4 0.0005
    age (Factor+Higher Order Factors) 103.01 6 <0.0001
    All Interactions 19.91 3 0.0002
    Nonlinear (Factor+Higher Order Factors) 6.05 4 0.1956
    sex (Factor+Higher Order Factors) 15.08 2 0.0005
    All Interactions 0.06 1 0.8096
    treat × age (Factor+Higher Order Factors) 19.91 3 0.0002
    Nonlinear 4.10 2 0.1286
    Nonlinear Interaction : f(A,B) vs. AB 4.10 2 0.1286
    treat × sex (Factor+Higher Order Factors) 0.06 1 0.8096
    TOTAL NONLINEAR 6.05 4 0.1956
    TOTAL INTERACTION 19.98 4 0.0005
    TOTAL NONLINEAR + INTERACTION 20.86 6 0.0019
    TOTAL 138.58 9 <0.0001
    Code
    ggplot(Predict(h, sex, treat, age=50))
    k <- contrast(h, list(treat='B', sex=levels(sex), age=50),
                     list(treat='A', sex=levels(sex), age=50))
    k <- as.data.frame(k[Cs(sex,age,Contrast,Lower,Upper)])
    ggplot(k, aes(y=exp(Contrast), x=sex)) + geom_point() +
      scale_y_log10(breaks=c(.5, .6, .7, .8, .9, 1, 1.1)) +
      geom_linerange(aes(ymin=exp(Lower), ymax=exp(Upper))) +
      xlab('Sex') + ylab('B:A Hazard Ratio') + coord_flip()

    (a) Partial effects plot

    (b) B:A hazard ratio by sex age age=50

    Figure 13.5: Estimates from a Cox model allowing treatment to interact with both age and sex

    This is a more accurate representation of the true underlying model, and is easy to interpret because (1) age and sex are uncorrelated in the simulation model used, and (2) only two interactions were considered.

    Absolute vs. Relative Treatment Effects Revisited

    Statistical models are typically chosen so as to maximize the likelihood of the model fitting the data and processes that generated them. Even though one may be interested in absolute effects, models must usually be based on relative effects so that quantities such as estimated risk are in the legal \([0,1]\) range. It is not unreasonable to think of a relative effects model such as the logistic model as providing supporting evidence that an interacting effect causes a change in the treatment effect, whereas the necessary expansion of treatment effect for higher risk subjects, as detailed in the next section, is merely a mathematical necessity for how probabilities work. One could perhaps say that any factor that confers increased risk for a subject (up to the point of diminishing returns; see below) might cause an increase in the treatment effect, but this is a general phenomenon that is spread throughout all risk factors independently of how they may directly affect the treatment effect. Because of this, additive risk models are not recommended for modeling outcomes or estimating differential treatment effects on an absolute scale. This logic leads to the following recommended strategy:

    U
    1. Base all inference and predicted risk estimation on a model most likely to fit the data (e.g., logistic risk model; Cox model). Choose the model so that it is possible that there may be no interactions, i.e., a model that does not place a restriction on regression coefficients.
    2. Pre-specify sensible possible interactions as described earlier
    3. Use estimates from this model to estimate relative differential treatment effects, which should be smooth functions of continuous baseline variables
    4. Use the same model to estimate absolute risk differences (as in the next section) as a function both of interacting factors and of baseline variables in the model that do not interact with treatment (but will still expand the absolute treatment effect)
    5. Even better: since expansion of risk differences is a general function of overall risk and not of individual risk factors only display effects of individual risk factors that interact with treatment (interacting on the scale that would have allowed them not to interact—the relative scale such as log odds) and show the general risk difference expansion as in Figures Figure 13.6 (think of the “risk factor” as treatment) and Figure 13.9

    To translate the results of clinical trials into practice may require a lot of work involving modelling and further background information. ‘Additive at the point of analysis but relevant at the point of application’ should be the motto.
    Stephen Senn

    13.6.2 Estimating Absolute Treatment Effects

    • Absolute efficacy measures:
      • Risk difference (\(\delta\))
      • number needed to treat (reciprocal of risk difference)
      • Years of life saved
      • Quality–adjusted life years saved
    • Binary response, no interactions with treatment, risk for control patient \(P\):
      \(\delta = P - \frac{P}{P + (1-P)/OR}\)
    • \(\delta\) is dominated by \(P\)
    VW
    Code
    plot(0, 0, type="n", xlab="Risk for Subject Without Risk Factor",
         ylab="Increase in Risk",
         xlim=c(0,1), ylim=c(0,.6))
    i <- 0
    or <- c(1.1,1.25,1.5,1.75,2,3,4,5,10)
    for(h in or) {
      i <- i + 1
      p <- seq(.0001, .9999, length=200)
      logit <- log(p/(1 - p))  # same as qlogis(p)
      logit <- logit + log(h)  # modify by odds ratio
      p2 <- 1/(1 + exp(-logit))# same as plogis(logit)
      d <- p2 - p
      lines(p, d, lty=i)
      maxd <- max(d)
      smax <- p[d==maxd]
      text(smax, maxd + .02, format(h), cex=.6)
    }

    Figure 13.6: Absolute risk increase as a function of risk for control subject. Numbers on curves are treatment:control odds ratios.

    If the outcome is such that \(Y=1\) implies a good outcome, Figure Figure 13.6 would be useful for estimating the absolute risk increase for a “good” treatment by selecting the one curve according to the odds ratio the treatment achieved in a multivariable risk model. This assumes that the treatment does not interact with any patient characteristic(s).

    van Klaveren et al. (2015) described the importance of correctly modeling interactions when estimating absolute treatment benefit.

    Now consider the case where \(Y=1\) is a bad outcome and \(Y=0\) is a good outcome, and there is differential relative treatment effect according to a truly binary patient characteristic \(X=0,1\). Suppose that treatment represents a new agent and the control group is patients on standard therapy. Suppose that the new treatment multiplies the odds of a bad outcome by 0.8 when \(X=0\) and by 0.6 when \(X=1\), and that the background risk that \(Y=1\) for patients on standard therapy ranges from 0.01 to 0.99. The background risk could come from one or more continuous variables or mixtures of continuous and categorical patient characteristics. The “main effect” of \(X\) must also be specified. We assume that \(X=0\) goes into the background risk and \(X=1\) increases the odds that \(Y=1\) by a factor of 1.4 for patients on standard therapy. All of this specifies a full probability model that can be evaluated to show the absolute risk reduction by the new treatment as a function of background risk and \(X\).

    X
    Code
    require(Hmisc)
    d <- expand.grid(X=0:1, brisk=seq(0.01, 0.99, length=150))
    d <- upData(d,
                risk.standard = plogis(qlogis(brisk) + log(1.4) * X),
                risk.new      = plogis(qlogis(brisk) + log(1.4) * X +
                                         log(0.8) * (X == 0) +
                                         log(0.6) * (X == 1)),
                risk.diff     = risk.standard - risk.new,
                X = factor(X) )
    Input object size:   19040 bytes;    2 variables     300 observations
    Added variable      risk.standard
    Added variable      risk.new
    Added variable      risk.diff
    Modified variable   X
    New object size:    27152 bytes;    5 variables 300 observations
    Code
    ggplot(d, aes(x=risk.standard, y=risk.diff, color=X)) +
      geom_line() +
      xlab('Risk Under Standard Treatment') +
      ylab('Absolute Risk Reduction With New Treatment')   # (* `r ipacue()` *)

    Figure 13.7: Absolute risk reduction by a new treatment as a function of background risk and an interacting factor

    It is important to note that the magnification of absolute risk reduction by increasing background risk should not be labeled (or analyzed) by any one contributing risk factor. This is a generalized effect that comes solely from the restriction of probabilities to the \([0,1]\) range.

    Absolute Treatment Effects for GUSTO–I

    Y

    • No evidence for interactions with treatment
    • Misleading subgroup analysis showed that elderly patients not benefit from \(t\)–PA; result of strong age \(\times\) Killip class interaction
    • Wide variation in absolute benefit of \(t\)–PA

    Z
    Code
    delta <- with(gustomin, p.sk - p.tpa)
    plot(density(delta), xlab='Mortality Difference',
         ylab='Probability Density', main='')
    m <- mean(delta)
    u <- par("usr")
    arrows(m, u[3], m, 0, length=.1, lwd=2)
    text(m, 2, 'Mean', srt=45, adj=0)
    med <- median(delta)
    arrows(med, u[3], med, 0, length=.1, lwd=2)
    text(med, 2, 'Median', srt=45, adj=0)

    Figure 13.8: Distribution of absolute risk reduction with \(t\)–PA vs. SK

    • Overall mortality difference of 0.011 dominated by high–risk patients
    A
    Code
    load('gusto.rda')
    require(rms)
    dd <- datadist(gusto); options(datadist='dd')
    f <- lrm(day30 ~ tx + age * Killip + pmin(sysbp, 120) +
               lsp(pulse, 50) + pmi + miloc, data=gusto)
    cat('<br><small>')


    Code
    f

    Logistic Regression Model

     lrm(formula = day30 ~ tx + age * Killip + pmin(sysbp, 120) + 
         lsp(pulse, 50) + pmi + miloc, data = gusto)
     
    Model Likelihood
    Ratio Test
    Discrimination
    Indexes
    Rank Discrim.
    Indexes
    Obs 40830 LR χ2 4173.41 R2 0.245 C 0.821
    0 37979 d.f. 15 R215,40830 0.097 Dxy 0.642
    1 2851 Pr(>χ2) <0.0001 R215,7955.8 0.407 γ 0.642
    max |∂log L/∂β| 5×10-6 Brier 0.055 τa 0.083
    β S.E. Wald Z Pr(>|Z|)
    Intercept  -3.9541  0.7135 -5.54 <0.0001
    tx=SK   0.0738  0.0512 1.44 0.1499
    tx=tPA  -0.1338  0.0608 -2.20 0.0276
    age   0.0867  0.0026 33.67 <0.0001
    Killip=II   2.1146  0.3610 5.86 <0.0001
    Killip=III   3.7596  0.7310 5.14 <0.0001
    Killip=IV   4.0790  0.8259 4.94 <0.0001
    sysbp  -0.0386  0.0017 -23.10 <0.0001
    pulse  -0.0221  0.0141 -1.57 0.1168
    pulse’   0.0416  0.0143 2.90 0.0037
    pmi=yes   0.4664  0.0485 9.62 <0.0001
    miloc=Other   0.3048  0.1163 2.62 0.0087
    miloc=Anterior   0.5370  0.0443 12.12 <0.0001
    age × Killip=II  -0.0216  0.0051 -4.22 <0.0001
    age × Killip=III  -0.0363  0.0103 -3.51 0.0004
    age × Killip=IV  -0.0323  0.0124 -2.61 0.0090
    Code
    anova(f)
    Wald Statistics for day30
    χ2 d.f. P
    tx 15.46 2 0.0004
    age (Factor+Higher Order Factors) 1390.07 4 <0.0001
    All Interactions 31.13 3 <0.0001
    Killip (Factor+Higher Order Factors) 427.94 6 <0.0001
    All Interactions 31.13 3 <0.0001
    sysbp 533.64 1 <0.0001
    pulse 325.19 2 <0.0001
    Nonlinear 8.43 1 0.0037
    pmi 92.55 1 <0.0001
    miloc 146.92 2 <0.0001
    age × Killip (Factor+Higher Order Factors) 31.13 3 <0.0001
    TOTAL NONLINEAR + INTERACTION 39.17 4 <0.0001
    TOTAL 3167.41 15 <0.0001
    Code
    cat('</small>')

    Code
    cof <- coef(f)             # vector of regression coefficients
    # For cof, X*beta without treatment coefficients estimates logit
    # for SK+t-PA combination therapy (reference cell).  The coefficient for
    # SK estimates the difference in logits from combo to SK.  The coefficient
    # for tPA estimates the difference in tPA from combo.  The mortality
    # difference of interest is mortality with SK minus mortality with tPA.
    mort.sk   <- function(x) plogis(x + cof['tx=SK'])
    mort.diff <- function(x)
      ifelse(x < 0, mort.sk(x) - plogis(x + cof['tx=tPA']), NA)
    # only define when logit < 0 since U-shaped
    n <- nomogram(f, fun=list(mort.sk, mort.diff),
                  funlabel=c("30-Day Mortality\nFor SK Treatment",
                             "Mortality Reduction by t-PA"),
                  fun.at=list(c(.001,.005,.01,.05,.1,.2,.5,.7,.9),
                    c(.001,.005,.01,.02,.03,.04,.05)),
                  pulse=seq(0,260,by=10), omit='tx', lp=FALSE)
    Code
    plot(n, varname.label.sep=' ', xfrac=.27, lmgp=.2, cex.axis=.6)

    Figure 13.9: Nomogram to predict SK - \(t\)–PA mortality difference, based on the difference between two binary logistic models.

    Absolute Benefit on Survival Prob.

    B

    • Cox PH model
    • Modeling can uncover time course of treatment effect
    • \(X_1\) = treatment, \(A={X_{2},\ldots,X_{p}}\) adjustment variables
    • Survival difference is
      \(S(t | X_{1}=1, A) - S(t | X_{1}=0, A) = S(t | X_{1}=0, A)^{HR} - S(t | X_{1}=0, A)\)

    C
    Code
    plot(0, 0, type="n", xlab="Survival for Control Subject",
         ylab="Improvement in Survival",
         xlim=c(0,1), ylim=c(0,.7))
    i <- 0
    hr <- seq(.1, .9, by=.1)
    for(h in hr) {
      i <- i + 1
      p <- seq(.0001, .9999, length=200)
      p2 <- p ^ h
      d <- p2 - p
      lines(p, d, lty=i)
      maxd <- max(d)
      smax <- p[d==maxd]
      text(smax,maxd+.02, format(h), cex=.6)
    }

    Figure 13.10: Relationship between baseline risk, relative treatment effect (hazard ratio — numbers above curves) and absolute treatment effect.

    • See also Kent & Hayward (2007)

    13.7 Cost–Effectiveness Ratios

    • Effectiveness E (denominator of C–E ratio) is always absolute
    • Absolute treatment effectiveness varies greatly with patient characteristics
    • ⇒ C–E ratio varies greatly
    • A C–E ratio based on average E and average C may not apply to any existing patient!
    • Need a model to estimate E
    • C may also depend on patient characteristics
    D

    E
    Code
    cost.life <- 2400 / delta / 1e6
    plot(density(cost.life), xlab='Cost Per Life Saved, $M', main='',
         ylab='Probability Density', xlim=c(0, 6))
    m <- 2400 / mean(delta) / 1e6
    u <- par("usr")
    arrows(m, u[3], m, 0, length=.1, lwd=2)
    text(m,.01,'Cost using\n  average\n    reduction',srt=45,adj=0)

    Figure 13.11: Distribution of cost per life saved in GUSTO–I

    13.8 Treatment Contrasts for Multi–Site Randomized Trials

    • Primary model: covariables, treatment, site main effects

    • Planned secondary model to assess consistency of treatment effects over sites (add site \(\times\) treatment interactions)

    • Advantages for considering sites as random effects (or use penalized MLE to shrink site effects, especially for small sites). See Andersen et al. (1999) for a random effects Cox model and a demonstration that treatment effects may be inconsistent when non–zero site main effects are ignored in the Cox model. See also Yamaguchi & Ohashi (1999).

    • Types of tests / contrasts when interactions are included (Schwemer (2000)):

      • Type I: not adjusted for center
      • Type II: average treatment effect, weighted by size of sites
        R rms package code:
      sites <- levels(site)
      contrast(fit, list(treat='b', site=sites),
                  list(treat='a', site=sites),
             type='average', weights=table(site))
      • Type III: average treatment effect, unweighted
      contrast(fit, list(treat='b', site=sites),
                  list(treat='a', site=sites), type='average')

      Low precision; studies are not powered for Type III tests.

    • Another interesting test: combined effects of treatment and site \(\times\) treatment interaction; tests whether treatment was effective at any site.

    FG

    13.9 Statistical Plan for Randomized Trials

    • FDA draft guidance - advocates for many forms of covariate adjustment
    • When a relevant dataset is available before the trial begins, develop the model from the dataset and use the predicted value as a single adjustment covariable in the trial (Knaus et al. (1993))
    • Otherwise: CPMP Working Party: Finalize choice of model, transformations, interactions before merging treatment assignment into analysis dataset.
      Edwards (1999): Pre–specify family of models that will be used, along with the strategy for selecting the particular model.
      Masked model derivation does not bias treatment effect.
    • CPMP guidance (Committee for Proprietary Medicinal Products, 2004)
      • ``Stratification may be used to ensure balance of treatments across covariates; it may also be used for administrative reasons. The factors that are the basis of stratification should normally be included as covariates in the primary model.
      • Variables known a priori to be strongly, or at least moderately, associated with the primary outcome and/or variables for which there is a strong clinical rationale for such an association should also be considered as covariates in the primary analysis. The variables selected on this basis should be pre-specified in the protocol or the statistical analysis plan.
      • Baseline imbalance observed post hoc should not be considered an appropriate reason for including a variable as a covariate in the primary analysis.
      • Variables measured after randomization and so potentially affected by the treatment should not normally be included as covariates in the primary analysis.
      • If a baseline value of a continuous outcome measure is available, then this should usually be included as a covariate. This applies whether the primary outcome variable is defined as the ‘raw outcome’ or as the ‘change from baseline’.
      • Only a few covariates should be included in a primary analysis. Although larger data sets may support more covariates than smaller ones, justification for including each of the covariates should be provided. (???)
      • In the absence of prior knowledge, a simple functional form (usually either linearity or dichotomising a continuous scale) should be assumed for the relationship between a continuous covariate and the outcome variable. (???)
      • The validity of the model assumptions must be checked when assessing the results. This is particularly important for generalized linear or non-linear models where mis-specification could lead to incorrect estimates of the treatment effect. Even under ordinary linear models, some attention should be paid to the possible influence of extreme outlying values.
      • Whenever adjusted analyses are presented, results of the treatment effect in subgroups formed by the covariates (appropriately categorised, if relevant) should be presented to enable an assessment of the validity of the model assumptions. (???)
      • Sensitivity analyses should be pre-planned and presented to investigate the robustness of the primary results. Discrepancies should be discussed and explained. In the presence of important differences that cannot be logically explained-for example, between the results of adjusted and unadjusted analyses-the interpretation of the trial could be seriously affected.
      • The primary model should not include treatment by covariate interactions. If substantial interactions are expected a priori, the trial should be designed to allow separate estimates of the treatment effects in specific subgroups.
      • Exploratory analyses may be carried out to improve the understanding of covariates not included in the primary analysis, and to help the sponsor with the ongoing development of the drug.
      • A primary analysis, unambiguously pre-specified in the protocol or statistical analysis plan, correctly carried out and interpreted, should support the conclusions which are drawn from the trial. Since there may be a number of alternative valid analyses, results based on pre-specified analyses will carry most credibility.’’
    H

    In confirmatory trials, a model is pre-specified, and it is necessary to pretend that it is true. In most other statistical applications, the choice of model is data–driven, but it is necessary to pretend that it is not.
    Edwards (1999); see also Sigueira & Taylor (1999)

    • Choose predictors based on expert opinion
    • Impute missing values rather than discarding observations
    • Keep all pre–specified predictors in model, regardless of \(P\)–value
    • Use shrinkage (penalized maximum likelihood estimation) to avoid over–adjustment
    • Some guidance for handling missing baseline data in RCTs is in White & Thompson (2005)
    I

    13.9.1 Sites vs. Covariables

    • Site effects (main or interaction) are almost always trivial in comparison with patient-specific covariable effects
    • It is not sensible to include site in the model when important covariables are omitted
    • The most logical and usually the most strong interaction with treatment is not site but is the severity of disease being treated
    J

    13.9.2 Covariable Adjustment vs. Allocation Based on Covariates

    K

    The decision to fit prognostic factors has a far more dramatic effect on the precision of our inferences than the choice of an allocation based on covariates or randomization approach and one of my chief objections to the allocation based on covariates approach is that trialists have tended to use the fact that they have balanced as an excuse for not fitting. This is a grave mistake.
    Senn (2004) p. 3748; see also Senn et al. (2010)

    My view .. was that the form of analysis envisaged (that is to say, which factors and covariates should be fitted) justified the allocation and not vice versa.
    Senn (2004), p. 3747

    13.10 Robustness

    Regression models relate some transformation, called the link function, of \(Y\) to the linear predictor \(X\beta\). For multiple linear regression the link function is the identity function. For binary \(Y\) the most popular link function is the log odds (logit) but other links are possible. White et al. (2021) have shown that when a standard link function (e.g., logit for binary \(Y\) and the identity function for continuous \(Y\)) is used, analysis of covariance is robust to model misspecification. See here for simulations supporting this for binary logistic regression. For a proportional odds ordinal logistic model, a simulated example shows robustness even in the face of adjusting for a very important covariate that causes an extreme violation of the proportional odds assumption. The effect of such a violation is to effectively ignore the covariate, making the adjusted analysis virtually the same as the unadjusted analysis if no other covariates are present.

    The primary reason for the result of White et al. (2021) is their demonstration that under an appropriate link function (e.g., one that does not artificially constrain the parameters as an additive risk model does) the adjusted treatment effect is zero if and only if the unadjusted treatment effect is zero.

    In general, violation of model assumptions about treatment effect parameters do not affect type I assertion probability \(\alpha\). That is because under the null hypothesis that the treatment is ignorable, there are no assumptions to be made about the form of treatment effects. For example in a proportional odds ordinal logistic model, under \(H_0\) the treatment effect must operate in a proportional odds fashion (i.e., all odds ratios involving treatment are 1.0 so they are all equal). It is impossible to violate the proportional odds assumption under the null.

    An example of poor model fitting where type I assertion probability \(\alpha\) for testing treatment the difference was not preserved is in the analysis of longitudinal ordinal response under a first-order Markov process when there is a strong time effect that is mis-modeled. If the mixture of outcome states changes arbitrarily over time so that time operates in a non-proportional odds manner but proportional odds is falsely assumed, the ordinary standard error estimates for the treatment effect are too small and \(\alpha\) is significantly larger than the intended 0.05. Using the cluster sandwich covariance matrix estimator in place of the usual information matrix-based standard errors completely fixed that problem.

    But as shown in the linked report, a complete solution is had by using a partial proportional odds model to relax the proportional odds assumption with respect to absolute time.

    13.11 Summary

    The point of view is sometimes defended that analyses that ignore covariates are superior because they are simpler. I do not accept this. A value of \(\pi=3\) is a simple one and accurate to one significant figure … However very few would seriously maintain that if should generally be adopted by engineers.
    Senn (2004), p. 3741

    L
    • As opposed to simple treatment group comparisons, modeling can
      • Improve precision (linear, log–linear models)
      • Get the “right” treatment effect (nonlinear models)
      • Improve power (almost all models)
      • Uncover outcome patterns, shapes of effects
      • Test/estimate differential treatment benefit
      • Determine whether some patients are too sick or too well to benefit
      • Estimate absolute clinical benefit as a function of severity of illness
      • Estimate meaningful cost–effectiveness ratios
      • Help design the next clinical trial (optimize risk distribution for maximum power)
    • Modeling strategy must be well thought–out
      • Not “data mining”
      • Not done to optimize the treatment \(P\)–value
    M

    13.12 Notes

    I think it is most important to decide what it is you want to estimate, and then formulate a model that will accomplish that. Unlike ordinary linear models, which provide unbiased treatment effects if balanced covariates are mistakenly omitted from the model in an RCT, most models (such as the Cox PH model) result in biased treatment effects even when there is perfect balance in covariates, if the covariates have nonzero effects on the outcome. This is another way of talking about residual outcome heterogeneity.

    If you want to estimate the effect of variable \(X\) on survival time, averaging over males and females in some strange undocumented way, you can get the population averaged effect of \(X\) without including sex in the model. Recognize however this is like comparing some of the males with some of the females when estimating the \(X\) effect. This is seldom of interest. More likely we want to know the effect of \(X\) for males, the effect for females, and if there is no interaction we pool the two to more precisely estimate the effect of \(X\) conditional on sex. Another way to view this is that the PH assumption is more likely to hold when you condition on covariates than when you don’t. No matter what happens though, if PH holds for one case, it cannot hold for the other, e.g., if PH holds after conditioning, it cannot hold when just looking at the marginal effect of X.
    — From a posting by Harrell to the Medstats google group on 19Jan09

    See the excellent Tweetorial by Darren Dahly here and this supplement to it by Ben Andrew.

    This article by Stephen Senn is also very helpful.