Vanderbilt Translational Research Forum
Edge for Scholars     2021-11-04

## Can You Experiment Until You Have An Answer?

• I.e., can you act like a physicist?
• Fully sequential trials are almost never used, and may be more useful than adaptive trials
• What keeps a researcher from collecting data until there is sufficient evidence in one direction, or until she runs out of time, money, or patience?

• Fixed budgets
• NIH project budgets vs. portfolio budgets

• Wouldn’t it be more logical to
• direct more resources to promising studies?
• direct resources away from unpromising studies (based on futility analysis)?

## What Would You Rather Know?

• How surprising is our data if the treatment has no effect?
• The probability of all possible levels of efficacy?

## What Would You Rather Know?

• If you like the first, you like the frequentist status quo
• p = P(data more impressive than ours if treatment is ignorable)
• If you like the second, you have Bayesian tendencies & like posterior probabilities
• P(unknown effect > c | data, prior information)
• E.g. P(BP reduction > 0 mmHg),
P(BP reduction > 5 mmHg),
P(similarity) = P(reduction between -3 and 3 mmHg)

## What Are the Most Common Outcomes of Clinical Trials?

• Insufficient patient recruitment
• Equivocal result
• At planned sample size result not “significant”
• Uncertainty intervals too wide to learn anything about efficacy
• The money was spent

## What Are the Most Common Causes of Equivocal Results?

• Over-optimistic power calculation
• using an effect size > clinically relevant
• if effect is “only” clinically relevant, likely to miss it
• Fixed sample size
• Losing power in the belief that α is something that should be controlled when taking multiple data looks
• Insensitive outcome measure

## The Problem With Sample Size

• Frequentist (traditional statistics): need a final sample size to know when α is “spent”
• Fixed budgeting also requires a maximum sample size
• Sample size calculations are voodoo
• arbitrary α, β, effect to detect Δ
• requires accurate σ or event incidence
• Physics approach: experiment until you have the answer

## Is a Sample Size Calculation Needed?

• Yes for budget, if fixed
• No sample size needed at all if using a Bayesian sequential design
• With Bayes, study extension is trivial and requires no α penalty
• logical to recruit more patients if result is promising but not definitive
0.7 < P(benefit) < 0.95
• analysis of cumulative data after new data added merely supersedes analysis of previous dataset

## Is α a Good Thing to Control?

• α = type I assertion probability = P(p < 0.05 | H0) typically
• It is not the probability of making an error in acting as if a treatment works
• It is the probability of making an assertion of efficacy (rejecting H0) when any assertion of efficacy would by definition be wrong (i.e., under H0)
• α ⇑ when as # data looks ⇑

## Is α a Good Thing to Control? continued

• α includes the probability of things that might have happened
• even if we have very strong evidence of efficacy at a given time, our design had the possibility of showing an efficacy signal at other times had efficacy=0
• Bayesian methods deal with what did happen and not what might have happened
• Analogy: using α is like judging a gambler by the proportion of games in which she places a bet.
Instead, we judge by the proportion of times she won when she placed a bet

## Is α a Good Thing to Control? continued

• Another analogy: using α is like a trial judge who brags about the low fraction of innocent defendants he convicts; Bayesian probs. are P(current defendant is guilty)
• Controlling α leads to conservatism when there are multiple data looks
• Bayesian sequential designs: expected sample size at time of stopping study for efficacy/harm/futility ⇓ as # looks ⇑

## What Should We Control?

• The probability of being wrong in acting as if a treatment works
• This is one minus the Bayesian posterior probability of efficacy (probability of inefficacy or harm)
• Controlled by the prior distribution (+ data, statistical model, outcome measure, sample size)
• Example: P(HR < 1 | current data, prior) = 0.96
⇒ P(HR ≥ 1) = 0.04 (inefficacy or harm)

## The Shocking Truth of Bayes

• It controls the reliability of evidence at the decision point
• Not the pre-study tendency for data extremes under an unknowable assumption
• Simulation examples: bit.ly/bayesOp

## How Does a Bayesian Sequential Design Gain Power?

• Not having an α penalty
• Being directional (no penalty for possibility of making a claim for an increase in mortality)
• Allowing for infinitely many data looks
• Borrowing information when there is treatment effect modification (interaction)

## Why Do Pivotal Cardiovascular Trials Need 6000-10000 Pts?

• Because they use low information outcome: time to binary event
• Do not distinguish a small MI from death and completely ignore death after a nonfatal MI
• Need 462 events to estimate a hazard ratio to within a factor of 1.20 (from 0.95 CI)
• Need 384 patients to estimate a difference in means to within 0.2 SD (n = 96 for Xover design)
• Event incidence ⇓ (censoring ⇑) ⇒ power for time to event = power of binary outcome

## General Outcome Attributes

• Timing and severity of outcomes
• Handle
• terminal events (death)
• non-terminal events (MI, stroke)
• recurrent events (hospitalization)
• Break the ties; the more levels of Y the better
fharrell.com/post/ordinal-info
• Maximum power when there is only one patient at each level (continuous Y)

## What is a Fundamental Outcome Assessment?

• In a given week or day what is the severity of the worst thing that happened to the patient?
• Expert clinician consensus of outcome ranks
• Spacing of outcome categories irrelevant
• Avoids defining additive weights for multiple events on same week
• Events can be graded & can code common co-occurring events as worse event

## Fundamental Outcome, continued

• Can translate an ordinal longitudinal model to obtain a variety of estimates
• time until a condition
• expected time in state
• probability of something bad or worse happening to the pt over time, by treatment
• Bayesian partial proportional odds model can compute the probability that the treatment affects mortality differently than it affects nonfatal outcomes

## Fundamental Outcome, continued

• Ordinal longitudinal model also elegantly handles partial information: at each day/week the ordinal Y can be left, right, or interval censored when a range of the scale was not measured

## Examples of Longitudinal Ordinal Outcomes

• censored at 3w: 000
• death at 2w: 01
• longitudinal binary logistic model OR ≅ HR
• 0=at home 1=hospitalized 2=MI 3=dead
• hospitalized at 3w, rehosp at 7w, MI at 8w & stays in hosp, f/u ends at 10w: 0010001211

## Examples, continued

• 0-6 QOL excellent–poor, 7=MI 8=stroke 9=dead
• QOL varies, not assessed in 3w but pt event free, stroke at 8w, death 9w: 12[0-6]334589
• MI status unknown at 7w: 12[0-6]334[5,7]89
• Can make first 200 levels be a continuous response variable and the remaining values represent clinical event overrides

## Statistical Model

• Proportional odds ordinal logistic model with covariate adjustment
• Handles intra-patient correlation with a Markov process or other longitudinal models
• Extension of binary logistic model
• Generalization of Wilcoxon-Mann-Whitney Two-Sample Test
• No assumption about Y distribution for a given patient type
• Does not use the numeric Y codes

## Interpretation

• B:A odds ratio
• P(B > A)
c-index; concordance probability ≅ OR0.66/(1 + OR0.66)
fharrell.com/post/po – does not assume proportional odds!
• Probability that Y=y or worse as a function of time and treatment
does assume PO but the partial PO model relaxes this
• Bayesian partial PO model: compute posterior P(treatment affects death differently)

## Examples of Power Gain

• VIOLET (Petal Network, NEJM 381:2529, 2019)
• Early high-dose vitamin D$$_3$$ for 1360 D$$_3$$-deficient critically ill adults
• Primary endpoint: mortality (slight evidence for increase with D$$_3$$)
• Ordinal endpoint collected each day for 28 consecutive days

## Statistical Power from Using More Raw Data

• Simulation of VIOLET-like studies
Method Power
Time-to-recovery analysis with Cox model 0.79
Wilcoxon test of vent/ARDS-free days with death=-1 0.32
Longitudinal ordinal model 0.94

Details at hbiostat.org/R/Hmisc/markov/sim.html

## General Principles for Choosing Outcome Measures

• Increase power by breaking ties
• Get close to the raw data
• Relevant to patients
• Respect timing and severity of outcomes
• Can do automatic risk/benefit trade-offs by including safety events in an ordinal outcome scale

• Has almost nothing to do with baseline balance across treatments
• Has to do with outcome heterogeneity within a treatment group
• Does this by getting the outcome model more correct
Example: older patients die sooner; pts with poor baseline 6m walk test have poor post-treatment 6m walk test

• Provides estimates of efficacy for individual patients by addressing the fundamental clinical question:
• If I compared two patients who have the same baseline variables but were given different treatments, by how much better should I expect the outcome to be with treatment B instead of treatment A?

• Provides a correct basis for analysis of heterogeneity of treatment effect
• subgroup analysis is very misleading and does not inherit proper covariate adjustment
• subgroup analysis does not properly handle continuous variables
• Continuous Y with X explaining $$\frac{1}{2}$$ of the variation in Y
• sample size cut $$\times \frac{1}{2}$$ in comparison to unadjusted analysis

## Don’t Compute Change from Baseline!

• Change from baseline (post - pre) assumes
• post is linearly related to pre
• slope of post on pre is 1.0
• ANCOVA assumes neither so is more efficient and extends to ordinal outcomes
• post - pre is inconsistent with ∥ group design
• post - pre is manipulated by pt inclusion criteria and RTTM

## Take Home Messages

• Don’t take sample sizes seriously; consider sequential designs with unlimited data looks and study extension
• α is not a relevant quantity to “control” or “spend” (unrelated to decision error)
• Choose high-resolution high-information Y
• Longitudinal ordinal Y is a general and flexible way to capture severity and timing of outcomes
• Always adjust for strong baseline prognostic factors; don’t stratify by treatment in Table 1

• datamethods.org
• fharrell.com
• hbiostat.org
• hbiostat.org/proj/covid19
• hbiostat.org/bbr/md/alpha.html