Consilium Scientific     2024-03-14

Background

Can You Experiment Until You Have An Answer?

  • I.e., can you act like a physicist?
  • Fully sequential trials are almost never used, and may be more useful than adaptive trials
  • What keeps a researcher from collecting data until there is sufficient evidence in one direction, or until she runs out of time, money, or patience?

The Answers

  • Traditional frequentist statistics
  • Fixed budgets
    • NIH project budgets vs. portfolio budgets

  • Wouldn’t it be more logical to
    • direct more resources to promising studies?
    • direct resources away from unpromising studies (based on futility analysis)?

What Would You Rather Know?

  • The probability of a positive result when there is nothing?
  • The probability that a positive result turns out to be nothing?

  • How surprising is our data if the treatment has no effect?
  • The probability of all possible levels of efficacy?

What Would You Rather Know?

  • If you like the first, you like the frequentist status quo
    • p = P(data more impressive than ours if treatment is ignorable)
  • If you like the second, you have Bayesian tendencies & like posterior probabilities
    • P(unknown effect > c | data, prior information)
    • E.g. P(BP reduction > 0 mmHg),
      P(BP reduction > 5 mmHg),
      P(similarity) = P(change in [-3, 3] mmHg)

Power/Sample Size

What Are the Most Common Outcomes of Clinical Trials?

  • Insufficient patient recruitment
  • Equivocal result
    • At planned sample size result not “significant”
    • Uncertainty intervals too wide to learn anything about efficacy
    • The money was spent

What Are the Most Common Causes of Equivocal Results?

  • Over-optimistic power calculation
    • using an effect size > clinically relevant
    • if effect is “only” clinically relevant, likely to miss it
  • Fixed sample size
  • Losing power in the belief that α is something that should be controlled when taking multiple data looks
  • Insensitive outcome measure

Another Common Outcome

  • Little evidence the treatment works
  • Usually the study could have been stopped much sooner

The Problem With Sample Size

  • Frequentist (traditional statistics): need a final sample size to know when α is “spent”
  • Fixed budgeting also requires a maximum sample size
  • Sample size calculations are voodoo
    • arbitrary α, β, effect to detect Δ
    • requires accurate σ or event incidence
  • Physics approach: experiment until you have the answer

Is a Sample Size Calculation Needed?

  • Yes for budget, if fixed
  • No sample size needed at all if using a Bayesian sequential design
  • With Bayes, study extension is trivial and requires no α penalty
    • logical to recruit more patients if result is promising but not definitive
      0.7 < P(benefit) < 0.95
    • analysis of cumulative data after new data added merely supersedes analysis of previous dataset

Is α a Good Thing to Control?

  • α = type I assertion probability = P(p < 0.05 | H0) typically
  • It is not the probability of making an error in acting as if a treatment works
  • It is the probability of making an assertion of efficacy (rejecting H0) when any assertion of efficacy would by definition be wrong (i.e., under H0)
  • α ⇑ when as # data looks ⇑

Is α a Good Thing to Control? continued

  • α includes the probability of things that might have happened
    • even if we have very strong evidence of efficacy at a given time, our design had the possibility of showing an efficacy signal at other times had efficacy=0
    • Bayesian methods deal with what did happen and not what might have happened

Is α a Good Thing to Control? continued

  • Asking one to compute α for a Bayesian design is like asking a poker player winning > $10M/year to justify his ranking by how seldom he places bets in games he didn’t win.

Is α a Good Thing to Control? continued

  • Controlling α leads to conservatism when there are multiple data looks
    • Bayesian sequential designs: expected sample size at time of stopping study for efficacy/harm/futility ⇓ as # looks ⇑

What Should We Control?

  • The probability of being wrong in acting as if a treatment works
  • This is one minus the Bayesian posterior probability of efficacy (probability of inefficacy or harm)
  • Controlled by the prior distribution (+ data, statistical model, outcome measure, sample size)
  • Example: P(HR < 1 | current data, prior) = 0.96
    ⇒ P(HR ≥ 1) = 0.04 (inefficacy or harm)

The Shocking Truth of Bayes

  • It controls the reliability of evidence at the decision point
  • Not the pre-study tendency for data extremes under an unknowable assumption
  • Simulation examples: bit.ly/bayesOp

How Does a Bayesian Sequential Design Gain Power?

  • Not having an α penalty
  • Being directional (no penalty for possibility of making a claim for an increase in mortality)
  • Allowing for infinitely many data looks
  • Borrowing information when there is treatment effect modification (interaction)

Outcome Variable

Why Do Pivotal Cardiovascular Trials Need 6000-10000 Pts?

  • Because they use low information outcome: time to binary event
  • Do not distinguish a small MI from death and completely ignore death after a nonfatal MI
  • Need 462 events to estimate a hazard ratio to within a factor of 1.20 (from 0.95 CI)
  • Need 384 patients to estimate a difference in means to within 0.2 SD (n = 96 for Xover design)
  • Event incidence ⇓ (censoring ⇑) ⇒ power for time to event = power of binary outcome

General Outcome Attributes

  • Timing and severity of outcomes
  • Handle
    • terminal events (death)
    • non-terminal events (MI, stroke)
    • recurrent events (hospitalization)
  • Break the ties; the more levels of Y the better
    fharrell.com/post/ordinal-info
  • Maximum power when there is only one patient at each level (continuous Y)

What is a Fundamental Outcome Assessment?

  • In a given week or day what is the severity of the worst thing that happened to the patient?
  • Expert clinician consensus of outcome ranks
  • Spacing of outcome categories irrelevant
  • Avoids defining additive weights for multiple events on same week
  • Events can be graded & can code common co-occurring events as worse event

Fundamental Outcome, continued

  • Can translate an ordinal longitudinal model to obtain a variety of estimates
    • time until a condition
    • expected time in state
    • probability of something bad or worse happening to the pt over time, by treatment
  • Bayesian partial proportional odds model can compute the probability that the treatment affects mortality differently than it affects nonfatal outcomes

Fundamental Outcome, continued

  • Ordinal longitudinal model also elegantly handles partial information: at each day/week the ordinal Y can be left, right, or interval censored when a range of the scale was not measured

Examples of Longitudinal Ordinal Outcomes

  • 0=alive 1=dead
    • censored at 3w: 000
    • death at 2w: 01
    • longitudinal binary logistic model OR ≅ HR
  • 0=at home 1=hospitalized 2=MI 3=dead
    • hospitalized at 3w, rehosp at 7w, MI at 8w & stays in hosp, f/u ends at 10w: 0010001211

Examples, continued

  • 0-6 QOL excellent–poor, 7=MI 8=stroke 9=dead
    • QOL varies, not assessed in 3w but pt event free, stroke at 8w, death 9w: 12[0-6]334589
    • MI status unknown at 7w: 12[0-6]334[5,7]89
  • Can make first 200 levels be a continuous response variable and the remaining values represent clinical event overrides

Statistical Model

  • Proportional odds ordinal logistic model with covariate adjustment
  • Handles intra-patient correlation with a Markov process or other longitudinal models
  • Extension of binary logistic model
  • Generalization of Wilcoxon-Mann-Whitney Two-Sample Test
  • No assumption about Y distribution for a given patient type
  • Does not use the numeric Y codes

Interpretation

  • B:A odds ratio
  • P(B > A)
    c-index; concordance probability ≅ OR0.65/(1 + OR0.65)
    fharrell.com/post/po – does not assume proportional odds!
  • Probability that Y=y or worse as a function of time and treatment
    does assume PO but the partial PO model relaxes this
  • Time gained in good states (difference in mean time unwell)
  • Bayesian partial PO model: compute posterior P(treatment affects death differently)

Interpretation, continued

Examples of Power Gain

  • VIOLET (Petal Network, NEJM 381:2529, 2019)
  • Early high-dose vitamin D\(_3\) for 1360 D\(_3\)-deficient critically ill adults
  • Primary endpoint: mortality (slight evidence for increase with D\(_3\))
  • Ordinal endpoint collected each day for 28 consecutive days

VIOLET Day 1 - 14 Ordinal Outcomes

Relative Efficiency: Single Day vs. More

Statistical Power from Using More Raw Data

  • Simulation of VIOLET-like studies
Method Power
Time-to-recovery analysis with Cox model 0.79
Wilcoxon test of vent/ARDS-free days with death=-1 0.32
Longitudinal ordinal model 0.94

Details at hbiostat.org/R/Hmisc/markov/sim.html

General Principles for Choosing Outcome Measures

  • Increase power by breaking ties
  • Get close to the raw data
  • Relevant to patients
  • Respect timing and severity of outcomes
  • Can do automatic risk/benefit trade-offs by including safety events in an ordinal outcome scale

Covariate Adjustment

Covariate Adjustment

  • Has almost nothing to do with baseline balance across treatments
  • Has to do with outcome heterogeneity within a treatment group
  • Adjustment for strong baseline prognostic factors increases Bayesian and frequentist power for free
  • Does this by getting the outcome model more correct
    Example: older patients die sooner; pts with poor baseline 6m walk test have poor post-treatment 6m walk test

Covariate Adjustment, continued

  • Provides estimates of efficacy for individual patients by addressing the fundamental clinical question:
    • If I compared two patients who have the same baseline variables but were given different treatments, by how much better should I expect the outcome to be with treatment B instead of treatment A?

Covariate Adjustment, continued

  • Provides a correct basis for analysis of heterogeneity of treatment effect
    • subgroup analysis is very misleading and does not inherit proper covariate adjustment
    • subgroup analysis does not properly handle continuous variables
  • Continuous Y with X explaining \(\frac{1}{2}\) of the variation in Y
    • sample size cut \(\times \frac{1}{2}\) in comparison to unadjusted analysis

Don’t Compute Change from Baseline!

  • Change from baseline (post - pre) assumes
    • post is linearly related to pre
    • slope of post on pre is 1.0
  • ANCOVA assumes neither so is more efficient and extends to ordinal outcomes
  • post - pre is inconsistent with ∥ group design
  • post - pre is manipulated by pt inclusion criteria and RTTM

Take Home Messages

Take Home Messages

  • Don’t take sample sizes seriously; consider sequential designs with unlimited data looks and study extension
  • α is not a relevant quantity to “control” or “spend” (unrelated to decision error)
  • Choose high-resolution high-information Y
  • Longitudinal ordinal Y is a general and flexible way to capture severity and timing of outcomes
  • Always adjust for strong baseline prognostic factors; don’t stratify by treatment in Table 1

More Information and To Continue the Discussion

  • datamethods.org
  • fharrell.com
  • hbiostat.org
  • hbiostat.org/proj/covid19
  • hbiostat.org/bbr/md/alpha
  • hbiostat.org/bayes/bet/design \(\leftarrow\)
  • fharrell.com/post/nfl