Bayesian Thinking

Frank Harrell

Department of Biostatistics
Vanderbilt University School of Medicine
Nashville Tennessee USA

Pogue Lecture
McMaster University

2024-12-06

The Essence of Bayes

Bayes' Rule

  • Respects flow of information & time
  • Data are known; is not
  • Makes use of pre-data knowledge about
  • Very important, but Bayesian thinking is even more important

Historical Note

  • Thomas Bayes' work was published in 1763 after his death
  • Pierre-Simon Laplace developed a broad interpretation in 1774
  • Majority of applications of statistical inference use Ronald Fisher's approach from 1925
  • Unlike Fisher, Bayes and Laplace did not travel the world preaching their approach, or have hoards of graduate students to advocate it

Historical Note: Illusion of Objectivity

Fisher touted his approach as being objective by

  • sneakily changing the question (does the treatment work?)
  • to a different question that could be answered without computers and without any prior knowledge (if the treatment doesn't work, how surprised should we be by the data?)

Illusion of Objectivity, continued

  • Without prior knowledge is actually misleading
  • You must know the full design and intentions of the investigator to be able to compute p
    • In the famous lady tasting tea experiment, Fisher never provided the information needed to compute a p-value!
    • (Think of p from a binomial data distribution vs. from a negative binomial distribution)

J Berger and D Berry, American Scientist, 1988
S Senn, Significance, 2012

Historical Note: Computation

  • For non-toy problems, the computations required were prohibitive until BUGS was released by David Spiegelhalter in 1989
  • Software for more general models and hardware improvements for faster computation happened rapidly after 2000
  • This corresponds to a rapid uptake of the Bayesian approach
    • especially by those who were not taught that Bayes is bad in graduate school

Recent Landmarks

  • 2012: Stan Bayesian modeling language
  • 2015: brms R package - easy specification for a huge variety of statistical models
  • 2016: revolutionary and highly accessible book Statistical Rethinking by Richard McElreath
  • 2024: many important Bayesian initiatives at FDA +

Typical Presentation of Bayesian Results

S Heuts et al, Canadian J Cardiology 2024

Bayesian vs. Frequentist Statements

  • RCT comparing Tx A and B, observed mean difference 5 mmHg, unknown mean difference
  • Frequentist: if , p=0.02 =
    • p = degree to which is embarrassed by the data
    • Uncertainty interval: CI has only long-run properties
  • Bayesian: Tx B probably (0.97) lowers BP;
    • Uncertainty interval: highest posterior density or credible interval:

Example: Evidence for a Defect in a Randomizer

  • An EHR randomizer has assigned 130 pts to Tx A and 94 to B
  • Intended 1:1 randomization. Is the randomizer defective?
  • p=0.019, Wilson CI [0.51, 0.64]
  • Smart clinician: "Oh, the chance of getting an imbalance this extreme if randomization is truely 1:1 is 0.019" - No!

Evidence for a Defect, continued

  • p = 0.006 + 0.013
  • And what about clinical vs. statistical significance?

Bayesian Evidence for a Defect

  • We have prior information!
  • unlikely to be outside or someone would have noticed
  • prior with
  • This is evidence for a non-trivial defect in the randomizer
  • Which is more actionable: 0.93 prob. of a non-trivial defect, or knowing that we get data as or more surprising 0.019 of the time if there is zero defect?
    • Note: "surprising" includes trivial deviations from 1:1

Bayesian Evidence, continued

hbiostat.org/bayes/bet/evidence

Some Remarks About Bayes vs. Frequentist Inference

  • Do you want the probability of a positive result when there is nothing ()? Or the probability that a positive result turns out to be nothing?
  • Bayes quantifies evidence for every value of ; frequentist hypothesis tests try to build evidence for what is not.

hbiostat.org/bbr/alpha

Remarks, continued

  • Frequentism is about a process: an endless stream of samples; long-term operating characteristics over these samples are emphasized
  • Bayes deals with uncovering hidden truths in the process that generated the current dataset
    • The data generating mechanism also allows inference to population quantities

Remarks, continued

  • The Bayesian approach goes out on a limb in order to answer the original question (is an effect > x). The frequentist approach stays close to home, not requiring quantification of prior knowledge, to answer an easier but almost irrelevant question (how strange is my data).
  • Null hypothesis testing is simple because it kicks down the road the gymnastics needed to subjectively convert observations about data to evidence about parameters.

fharrell.com/post/journey

Bayesian and Frequentist Smoke Detectors

  • : 0.05 chance of an alarm when there is no fire
  • Bayes:
    • Set alarm to trigger when
    • or
  • Frequentist alarm designed as how most research is done:
    • 0.8 chance of detecting an inferno

More Remarks

  • The frequentist type I "error" is the probability of asserting an effect when there is no effect, and is independent of data.
  • = probability the treatment doesn't work whether or not you assert that it does.

Bayesian Thinking

Replication

  • F: Replication is central
  • B: How do you know that an experiment is worth replicating?
    What is the evidence at present?

Sample Size

  • F: Assume everything, even the MCID
    fixed; spent against this
  • B: Put uncertainties around everything you don't know, including MCID
    • Make a random variable
    • Experiment until you have enough evidence to make a decision
    • Stop much earlier than F for futility

Multiplicity

  • F: Multiplicity comes from chances you give data to be extreme (e.g., sequential tests)
    • And from detaching clinical and statistical significance
  • B: No multiplicity from sequential looks
    • New looks merely make previous looks obsolete
  • F: If team A ultimately ties or loses, P(team A ever head by 10+ points) as #looks
  • B: Look continuously and stop watching the game the instant P(team A ultimately wins > 0.9); no multiplicity

Football, continued

  • Average P(A wins) at moment of stopping: 0.913
    1,985 games analyzed, P(A won when P(A wins) > 0.9) = 0.913

fharrell.com/post/nfl

Asking a Bayesian to Compute is Like ...

  • telling a patient the specificity of the test he already underwent (P(test - | no disease))
  • asking a poker player winning $10M/year to justify his ranking by how often he places bets in games he didn't win
  • judging a politician by how often he speaks when he's honest instead how often he lies when he speaks

Effect Estimates at Early Stopping

Frequentist

  • Group sequential testing, -spending-based stopping boundaries
  • When stop early for + efficacy, effect estimate is biased high in frequentist terms

Effect Estimates at Early Stopping

Bayesian

  • Stop when posterior probability of efficacy provides sufficient confidence in the result
  • Posterior distribution has the same interpretation at all times and for all stopping rules
  • posterior mean/median/mode/pseudomedian are pulled back by the right amount
  • More pulling back with earlier looks b/c prior is more influential then

Effect Estimates at Early Stopping, continued

hbiostat.org/bayes/bet/sequential

-Test

  • F: Agonize over whether normality holds and variances are equal
    Ignore double dipping
  • B: Add parameters for amount of non-normality and for ratio of variances, with priors
    • Uncertainty interval will be properly wider b/c we're not certain re:normality, variances

hbiostat.org/bbr/htest

Interactions

  • F: Up-front decision to include or exclude Tx covariate interaction
  • B: Interaction half-in, half-out of model
    Prior for interaction; specifies amount of borrowing of Tx effects across pt types

R Simon & L Freedman, Biometrics 1997

Clinical & Public Health Significance

  • F: Try to interpret point estimate & CI if reject
    Non-inferiority or interval nulls
  • B: Compute for as many as desired, from one analysis

Regulatory Evidence for Drugs

  • F: Try to make sense of multiple separate (superiority on one endpoint, non-inferior on another, etc.)
  • B: Write practice-changing condition, compute
    • = any reduction in mortality (HR < 1.0)
    • = reduction or only a small increase in mortality (HR < 1.1)
    • = reduction in infection death (OR < 1)
    • = improvement or small worsening in performance status (OR < 1.2)

Regulator's Regret

  • Suppose that 5 drugs were approved, with p=0.02, 0.04, 0.01, 0.05, 0.04
  • Regulator wants to know track record, e.g., error rate
  • Can't get that from freq. approach
  • Suppose posterior probs were 0.96, 0.95, 0.98, 0.92, 0.96
  • P(drug doesn't work) = 0.04, 0.05, 0.02, 0.08, 0.04
  • Expected # mistakes: sum of these 5 = 0.23 out of 5 drugs

Missing Data

  • F: ad hoc procedure for multiple imputation
    Ad hoc procedure for final parameter estimation
    Complex adjustment of d.f. to get accurate CI coverage
  • B: MI then stack all posterior distribution samples from all completed datasets; uncertainty intervals OK
  • Better B: Joint models for Y and for all sometimes-missing variables
    Missing values are just parameters
    No imputation, get exact model-based inference

hbiostat.org/rmsc/missing

Merging Experimental & Observational Data

  • F: ??
  • B: Convert posterior distribution from obs. data to prior for experimental
    • Discount this prior, e.g. pediatric studies: mix two priors, one from adults, mixing fraction is P(applicability of adult data to kids)
  • Better B: Joint analyis of all raw data sources
    Shared parameters across models, explicit obs. bias parameters

fharrell.com/post/hxcontrol

High Dimensional Data

  • F: ridge, lasso, elastic net: computational convenience, limited post-fitting inference with penalized MLE
  • B: choose prior to match a family of expected effects
    Inference with ordinary posterior distributions

High Dimensional Data, continued

avehtari.github.io/modelselection/regularizedhorseshoe_slides.pdf

Resource Consumption

  • F: approximations including method (with poor confidence coverage), asymptotics, sufficient statistics, ancillarity, simulating , multiplicity adjustment, ...
  • B: model specification, compute time, diagnostics for convergence of posterior samples
    • Convergence for primary parameters exact asymmetric posterior distributions for complex nonlinear functions of them
    • If posterior is acceptable at final analysis, it must be acceptable for all interim analyses

Summary

  • Bayes' formula allows inference to be based on a simple multiplication; no complex sampling distributions
  • Bayes' formula is important but Bayesian thinking is the most important facet of the Bayesian approach
  • Bayesian thinking is all about
    • Problem solving
    • Quantifying evidence for things unseen instead of computing probabilities about data given unobservables
    • Answering direct, relevant questions

Summary

We need increased Bayesian training, case studies, realization of the joy of learning new things

  • training in sampling distributions, asymptotics, -method, multiple imputation, bootstrap, one-off solutions, etc.
  • training in model specification that reflects human/clinical/biological systems
    • Flexible models, adding parameters to P(bad fit)
    • Joint modeling of multiple outcomes, sometimes-missing variables, historical data, ...

More Information

  • hbiostat.org/bayes
  • fharrell.com/post/journey

Usage: marp --html talk.md

See https://www.hashbangcode.com/article/seven-tips-getting-most-out-marp https://yootheme.com/support/question/7348