Modernizing Clinical Trial Design and Analysis to Improve Efficiency & Flexibility

Seventh Annual Janice Pogue Lectureship in Biostatistics

Population Health Research Institute
2024-12-05

Background

Can You Experiment Until You Have An Answer?

I.e., can you act like a physicist?
Fully sequential trials are almost never used, and may be more useful than adaptive trials
What keeps a researcher from collecting data until there is sufficient evidence in one direction, or until she runs out of time, money, or patience?

The Answers

Traditional frequentist statistics
Fixed budgets
- Government grant project budgets vs. portfolio budgets
Wouldn’t it be more logical to
- direct more resources to promising studies?
- direct resources away from unpromising studies (high prob. of inefficacy or only trivial efficacy)?

What Would You Rather Know?

The probability of a positive result when there is nothing?
The probability that a positive result turns out to be nothing?
How surprising is our data if the treatment has no effect?
The probability of all possible levels of efficacy?

What Would You Rather Know?

If you like the first, you like the frequentist status quo
- p = P(data more impressive than ours if treatment is ignorable)
If you like the second, you have Bayesian tendencies & like posterior probabilities
- P(unknown effect > c | data, prior information)
- E.g. P(BP reduction > 0 mmHg),
  P(BP reduction > 5 mmHg),
  P(similarity) = P(change in [-3, 3] mmHg)

Power/Sample Size

What Are the Most Common Outcomes of Clinical Trials?

Insufficient patient recruitment
Equivocal result
- At planned sample size result not “significant”
- Uncertainty intervals too wide to learn anything about efficacy
- The money was spent

What Are the Most Common Causes of Equivocal Results?

Over-optimistic power calculation
- using an effect size > clinically relevant
- if effect is “only” clinically relevant, likely to miss it
Fixed sample size
Losing power in the belief that α is something that should be controlled when taking multiple data looks
Insensitive outcome measure

Another Common Outcome

Little evidence the treatment works
Usually the study could have been stopped much sooner

The Problem With Sample Size

Frequentist (traditional statistics): need a final sample size to know when α is “spent”
Fixed budgeting also requires a maximum sample size
Sample size calculations are voodoo
- arbitrary α, β, effect to detect Δ
- requires accurate σ or event incidence
Physics approach: experiment until you have the answer

Is a Sample Size Calculation Needed?

Yes for budget, if fixed
No sample size needed at all if using a Bayesian sequential design
With Bayes, study extension is trivial and requires no α penalty
- logical to recruit more patients if result is promising but not definitive
  0.7 < P(benefit) < 0.95
- analysis of cumulative data after new data added merely supersedes analysis of previous dataset

Is α a Good Thing to Control?

α = type I assertion probability = P(p < 0.05 | H₀) typically
It is not the probability of making an error in acting as if a treatment works
It is the probability of making an assertion of efficacy (rejecting H₀) when any assertion of efficacy would by definition be wrong (i.e., under H₀)
α ⇑ when as # data looks ⇑

Is α a Good Thing to Control? continued

α includes the probability of things that might have happened
- even if we have very strong evidence of efficacy at a given time, our design had the possibility of showing an efficacy signal at other times had efficacy=0
- Bayesian methods deal with what did happen and not what might have happened

Is α a Good Thing to Control? continued

Asking one to compute α for a Bayesian design is like asking a poker player winning > $10M/year to justify his ranking by how seldom he places bets in games he didn’t win.

Is α a Good Thing to Control? continued

Controlling α leads to conservatism when there are multiple data looks
- Bayesian sequential designs: expected sample size at time of stopping study for efficacy/harm/inefficacy ⇓ as # looks ⇑

What Should We Control?

The probability of being wrong in acting as if a treatment works
This is one minus the Bayesian posterior probability of efficacy (probability of inefficacy or harm)
Controlled by the prior distribution (+ data, statistical model, outcome measure, sample size)
Example: P(HR < 1 | current data, prior) = 0.96
⇒ P(HR ≥ 1) = 0.04 (inefficacy or harm)

The Shocking Truth of Bayes

It controls the reliability of evidence at the decision point
Not the pre-study tendency for data extremes under an unknowable assumption
Simulation examples: bit.ly/bayesOp

How Does a Bayesian Sequential Design Gain Power?

Not having an α penalty
Being directional (no penalty for possibility of making a claim for an increase in mortality)
Allowing for infinitely many data looks
Borrowing information when there is treatment effect modification (interaction)

Outcome Variable

Why Do Pivotal Cardiovascular Trials Need 6000-10000 Pts?

Because they use low information outcome: time to binary event
Do not distinguish a small MI from death and completely ignore death after a nonfatal MI
Need 462 events to estimate a hazard ratio to within a factor of 1.20 (from 0.95 CI)
Need 384 patients to estimate a difference in means to within 0.2 SD (n = 96 for Xover design)
Event incidence ⇓ (censoring ⇑) ⇒ power for time to event = power of binary outcome

General Outcome Attributes

Timing and severity of outcomes
Handle
- terminal events (death)
- non-terminal events (MI, stroke)
- recurrent events (hospitalization)
Break the ties; the more levels of Y the better
fharrell.com/post/ordinal-info
Maximum power when there is only one patient at each level (continuous Y)

What is a Fundamental Outcome Assessment?

In a given week or day what is the severity of the worst thing that happened to the patient?
Expert clinician consensus of outcome ranks
Spacing of outcome categories irrelevant
Avoids defining additive weights for multiple events on same week
Events can be graded & can code common co-occurring events as worse event

Fundamental Outcome, continued

Can translate an ordinal longitudinal model to obtain a variety of estimates
- time until a condition
- expected time in state
- probability of something bad or worse happening to the pt over time, by treatment
Bayesian partial proportional odds model can compute the probability that the treatment affects mortality differently than it affects nonfatal outcomes

Fundamental Outcome, continued

Ordinal longitudinal model also elegantly handles partial information: at each day/week the ordinal Y can be left, right, or interval censored when a range of the scale was not measured

Examples of Longitudinal Ordinal Outcomes

0=alive 1=dead
- censored at 3w: 000
- death at 2w: 01
- longitudinal binary logistic model OR ≅ HR
0=at home 1=hospitalized 2=MI 3=dead
- hospitalized at 3w, rehosp at 7w, MI at 8w & stays in hosp, f/u ends at 10w: 0010001211

Examples, continued

0-6 QOL excellent–poor, 7=MI 8=stroke 9=dead
- QOL varies, not assessed in 3w but pt event free, stroke at 8w, death 9w: 12[0-6]334589
- MI status unknown at 7w: 12[0-6]334[5,7]89
Can make first 200 levels be a continuous response variable and the remaining values represent clinical event overrides

Statistical Model

Proportional odds ordinal logistic model with covariate adjustment
Handles intra-patient correlation with a Markov process or other longitudinal models
Extension of binary logistic model
Generalization of Wilcoxon-Mann-Whitney Two-Sample Test
No assumption about Y distribution for a given patient type
Does not use the numeric Y codes

Interpretation

B:A odds ratio
P(B > A)
c-index; concordance probability ≅ OR^0.65/(1 + OR^0.65)
fharrell.com/post/po – does not assume proportional odds!
Probability that Y=y or worse as a function of time and treatment
does assume PO but the partial PO model relaxes this
Time gained in good states (difference in mean time unwell)
Bayesian partial PO model: compute posterior P(treatment affects death differently)

Interpretation, continued

Examples of Power Gain

VIOLET (Petal Network, NEJM 381:2529, 2019)
Early high-dose vitamin D\(_3\) for 1360 D\(_3\)-deficient critically ill adults
Primary endpoint: mortality (slight evidence for increase with D\(_3\))
Ordinal endpoint collected each day for 28 consecutive days

VIOLET Day 1 - 14 Ordinal Outcomes

Relative Efficiency: Single Day vs. More

Statistical Power from Using More Raw Data

Simulation of VIOLET-like studies

Method	Power
Time-to-recovery analysis with Cox model	0.79
Wilcoxon test of vent/ARDS-free days with death=-1	0.32
Longitudinal ordinal model	0.94

Details at hbiostat.org/R/Hmisc/markov/sim.html

General Principles for Choosing Outcome Measures

Increase power by breaking ties
Get close to the raw data
Relevant to patients
Respect timing and severity of outcomes
Can do automatic risk/benefit trade-offs by including safety events in an ordinal outcome scale

Covariate Adjustment

Covariate Adjustment

Has almost nothing to do with baseline balance across treatments
Has to do with outcome heterogeneity within a treatment group
Adjustment for strong baseline prognostic factors increases Bayesian and frequentist power for free
Does this by getting the outcome model more correct
Example: older patients die sooner; pts with poor baseline 6m walk test have poor post-treatment 6m walk test

Covariate Adjustment, continued

Provides estimates of efficacy for individual patients by addressing the fundamental clinical question:
- If I compared two patients who have the same baseline variables but were given different treatments, by how much better should I expect the outcome to be with treatment B instead of treatment A?

Covariate Adjustment, continued

Provides a correct basis for analysis of heterogeneity of treatment effect
- subgroup analysis is very misleading and does not inherit proper covariate adjustment
- subgroup analysis does not properly handle continuous variables
Continuous Y with X explaining \(\frac{1}{2}\) of the variation in Y
- sample size cut \(\times \frac{1}{2}\) in comparison to unadjusted analysis

Don’t Compute Change from Baseline!

Change from baseline (post - pre) assumes
- post is linearly related to pre
- slope of post on pre is 1.0
ANCOVA assumes neither so is more efficient and extends to ordinal outcomes
post - pre is inconsistent with ∥ group design
post - pre is manipulated by pt inclusion criteria and RTTM

Take Home Messages

Take Home Messages

Don’t take sample sizes seriously; consider sequential designs with unlimited data looks and study extension
α is not a relevant quantity to “control” or “spend” (unrelated to decision error)
Choose high-resolution high-information Y
Longitudinal ordinal Y is a general and flexible way to capture severity and timing of outcomes
Always adjust for strong baseline prognostic factors; don’t stratify by treatment in Table 1

More Information and To Continue the Discussion

datamethods.org
fharrell.com
hbiostat.org
hbiostat.org/proj/covid19
hbiostat.org/bbr/md/alpha
hbiostat.org/bayes/bet/design \(\leftarrow\)
fharrell.com/post/nfl