4 Multiplicity

flowchart TD
multi[Multiplicity<br><br>Treatments<br>Subgroups<br>Endpoints<br>Time] --> cp[Opportunities for<br>cherry picking] --> care[Pre-study<br>planning]
care --> cf[Frequentist:<br><br>very conservative<br>adjustment]
care --> cb[Bayesian:<br><br>appropriately<br>skeptical priors]
multi --> trans[Transparency<br>of priorities<br>without<br>opportunities<br>for cherry<br>picking<br><br>Pre-study planning]
trans --> tf[Frequentist:<br><br>subjective<br>non-data-dependent<br>adjustment<br><br>A vs B penalized<br>for comparing A vs C<br><br>Strict pre-specified<br>reporting order<br>should imply<br>less need<br>for adjustment]
trans --> tb[Bayesian:<br><br>appropriately<br>skeptical priors<br><br>assessing<br>> null effects<br><br>compound assertions<br><br>No penalty just<br>for answering<br>>1 question<br><br>Final evidence comes<br>only from data and<br>prior information<br><br>Evidence for A vs B<br>comes strictly<br>from A vs B<br>and prior]

The major types of multiplicities occurring in randomized clinical trials when analyzed using frequentist methods are sequential testing, multiple treatments, subgroups, and multiple endpoints. There are vastly different implications of these multiplicities in a Bayesian framework. Sequential testing is a non-issue, and the other multiplicities are best handled by using skeptical priors, moving attention towards clinical significance, or by translation of the question into a very precise compound assertion whose veracity is judged by a single posterior probability. The compound assertion can involve multiple parameters in one univariate model (e.g., B > A or C > A + 10%) or from a multivariate model for multiple outcomes (e.g., treatment B is non-inferior to A on endpoint 1 or superior to A on endpoint 2).

Instead of frequentist analysis focusing on randomness probabilities that create multiplicities, major gains in understanding and actionability come from translating the clinical question into an assertion that if true would result in a change in clinical practice. Then a Bayesian posterior probability provides the quantitative evidence for the truth of that assertion.

In the Bayesian paradigm evidence for an effect comes from the data and from pre-study information. Outside of hierarchical models for disjoint patient subsets, evidence is not discounted because a different effect was also assessed in the same study. Skepticism about a specific effect is encoded in the prior, not in a somewhat arbitrary label (e.g., co-primary endpoint) that is given to a comparison. Cherry picking only occurs when pre-study due diligence is not given to prior specification.

Frequentist α-spending involves results that may have been obtained and completely ignores results that were obtained since the computation of α does not involve any data. α-spending is not consistent with rules of evidence.

Confusion about multiplicity corrections abound. Traditional multiplicity corrections were designed to control family-wise \(\alpha\) for unions of hypotheses, e.g. \(H_{0}: A \cup B \cup C\) whereas the vast majority of studies are interested in separate assessments of evidence against \(A\), \(B\), and \(C\) (Rubin (2024)). Very few studies ask the question “Is the treatment effective for at least one of \(A, B, C\)?”. Were that to be the key question, a far better approach would be to compute the Bayesian probability of the union of three assertions, but to also factor in clinical significance for each assertion. In other words compute \(P(\theta_{1} > \epsilon_{1} \cup \theta_{2} > \epsilon_{2} \cup \theta_{3} > \epsilon_{3})\).

Types of Multiplicities

sequential testing
multiple treatments
subgroups
multiple endpoints

Paradigm-Specific Multiplicities

Frequentist
- lack of scientific rigor
- more chances for data to be extreme
Bayesian
- lack of scientific rigor
- giving too many chances to find trivial effects
- priors allowing huge treatment effects
- suboptimal data models

Frequentist Remedies

Prioritization
Test against effects \(> \epsilon, \epsilon > 0\)
- \(\alpha\) under 0 effects is fine regardless of # tests!
Arbitrary multiplicity adjustments / \(\alpha\) spending
- see paradox of two sponsors below
Penalizes comparison of C vs. D because A was compared to B
Conservative

Bayesian Remedies

Prioritization
Prior (covered later)
Form assertions for > trivial effects
Recognize that sequential testing is just updating the posterior distribution
Use compound assertions
Specify a better model

Sequential Testing

Non-issue for Bayes except that Bayesian OC need to be checked at the mean stopping time for efficacy
- Pr(inefficacy | sampling prior, Pr(efficacy | analysis prior) > 0.95)
Meaning of Pr(efficacy | current data, prior) does not change over sequential looks
Previously computed probs are merely obsolete
Current prob is well-calibrated independent of the stopping rule
Only Pr(extreme data) \(\uparrow\) as # looks \(\uparrow\); not Pr(effect)

Multiple Treatments

Compound assertions, e.g. A < B similar to C < D; B, C, or D > A + 5
Pr(any two treatments differ by more than 5)

Subgroups

Mutually exclusive truly categorical subgroups, e.g., non-overlapping genotypes (not age)
- exchangeable: Bayesian hierarchical model with random effects for subgroup-specific effects
General: separate priors on interaction effects (age, sex, …)

Multiple Endpoints

To not make it too easy to get success on one of many endpoints, for each endpoint compute Pr(effect > \(\epsilon\)) where \(\epsilon > 0\) and is endpoint-specific
Use compound assertion
- Pr(benefit on endpoint 1 and benefit on either 2 or 3)
- Pr(majority of 5 endpoints have some benefit of treatment)
- Pr(superiority on endpoint 1 or non-inferiority on endpoint 2)

Priors

Highly informative and relevant extra-study data can be used to form a prior
- better: bring the other data into the analysis: fharrell.com/post/hxcontrol
Or use general knowledge, e.g., can the treatment be curative?
If treatment is incremental than large effects are highly improbable
Skeptical priors
- very skeptical, no equipoise: treatment is more likely to harm than benefit pts
- skeptical with equipoise: benefit & harm equally likely
- Gaussian with mean 0, smaller \(\sigma \rightarrow\) more skeptical
- choose “large benefit”, Pr(benefit > this), solve for \(\sigma\)
- small \(\sigma\) can greatly lower posterior Pr(large benefit)
- less effect on Pr(any benefit)
- extremely skeptical prior \(\rightarrow\) don’t do the RCT

4.1 Background

Before getting into statistical issues, consider various types of multiple statistical assessments made from one dataset.

assessing effect of treatment separately on multiple patient types (subgroups)
separate effects on different clinical endpoints
simultaneous conditions involving multiple clinical endpoints
sequential assessments as the dataset matures, on the overall clinical trial sample and for one endpoint

For the third setting above, it is far better to move from multiplicity arising from assessing endpoints separately to single Bayesian posterior probabilities of compound assertions involving the endpoints, as discussed in Chapter 7.

Then there are the broadest classes of multiplicity:

cherry picking where one looks at various subgroups and endpoints and chooses the one to report based on which combination shows the largest treatment effect in the right direction
pre-planned multiple assessments where all results are to be reported

It is generally regarded that the first type, cherry picking, deserves the most severe accounting for multiple opportunities. As discussed below, the Bayesian approach to this is very simple with adequate planning, but fails without it. For the first type, Cook & Farewell (1996) argue that as long as cherry picking is disallowed and transparency is enforced, one should be able to ask multiple pre-planned questions and get separate answers without a multiplicity correction.

In the very special case where multiplicity arises solely from assessing a treatment effect in multiple disjoint subgroups, it is better, whether using a frequentist or a Bayesian approach, to “get the model right” rather than to engage in arbitrary after-the-fact multiplicity adjustments. This is done by using a hierarchical modeling, e.g., a Bayesian or frequentist random intercepts model whereby interaction terms involving treatment are shrunken down to what is likely replicable in a future study.

This approach is very often used inappropriately when there is one or more continuous baseline variable involved. For example if one of the variables is age, analysts frequently dichotomize age (e.g. at 65y) and pretend that the differential treatment effect is discontinuous at age 65 and that the treatment effect is constant within age < 65 and within age \(\geq\) 65. Hierarchical models should never be used when subgroups are artificial. Like main effects, continuous baseline variables, in the rare cases where they do interact with treatment, do so in a smooth dose-response manner.

Now consider how multiplicity applies and is addressed in the two major schools of statistics.

4.2 Frequentist

Clear that if test more hypotheses (whether they should be connected or not), chance of making assertions ↑ whether or not there is a true nonzero effect
No statistical principles that lead to unique solutions
Problems magnify with adaptive trials or sample size re-estimation
Frame of reference (grouping of hypotheses) often unclear
- does it include other studies?
- subgroups? endpoints? data looks?
Considers sample space, hypotheses that might be tested + those actually tested
- violates likelihood principle: Under the chosen statistical model, all of the evidence in a sample relevant to model parameters is contained in the likelihood function (Berry (1987))
“Paradox of two sponsors”: sponsor I designed study for one interim look
- α=0.047 cutoff at final look to preserve overall type I assertion probability at α=0.05
- first look inconsequential
- final analysis: p=0.049; drug not approved
Sponsor 2: did not do an interim look, same final data as sponsor 1
- p=0.049; drug approved
Early look discounted for planned later looks
Later look discounted for inconsequential early looks
Unblinded sample size re-estimation: first wave of data discounted to preserve overall α at end of extgended study
Consider 4-treatment study: A,B,C,D
- frequentist assessment of A vs B often discounted because C compared to D
None of these mult. adj. are scientifically satisfactory (Blume (2008))

Cook & Farewell (1996), as mentioned above, make a sound argument against arbitrary multiplicity adjustments and their resulting penalization for trying to learn more than one thing from a clinical trial. One can be scientifically accurate and honest by pre-specifying a priority order for the various tests, and presenting p-values in that order regardless of their magnitude.

4.3 Bayesian

PPs well calibrated no matter type or number of multiplicities present
Skepticism focused on effect of interest, not other effects tested
Current posterior density accurately reflects study evidence now
Obeys likelihood principle
Data and not context for data are important for inference
Example: inference about prob θ of an event
- Bayesian inference identical whether one sampled n=20 subjects and found 5 events or enrolls until 5 events occurred and this happened to require n=20 subjects
- First design: binomial; second: negative binomial
- Frequentist: 2 different CLs
- Bayesian: likelihood of the data same: θ⁵ (1-θ)¹⁵
Frequentist significance testing deals with “what would have occurred following results that were not observed at analyses that were never performed” (Emerson (1995))
- P(test statistic more extreme that observed value) depends on samples that might have arisen
- Bayes uses only the sample that has arisen
- Frequentists limit the sample space (e.g. data look frequency) to limit α: requires more planning and results in less flexibility
In comparing treatment A vs B in A,B,C,D study, Bayes would discount A vs B only because of prior information for how A might compare to B
Sequential trial: current PP self-contained, well calibrated, # looks irrelevant
Consider example
- Probabilistic pattern recognition system to identify enemy tanks in combat
- Initial assessment of Pr(enemy tank)=0.3 when tank is distant
- Tank comes closer, more images collected, Pr(enemy)=0.8
- Closer, air clears, P(enemy tank)=0.98
- Irrelevant that P(enemy tank) was lower a moment ago
- Pr(enemy tank) may decrease while shell is in the air, but Pr at time of firing was valid
- Pr(tank-like image) \(\uparrow\) as # images \(\uparrow\) but this is irrelevant
  - All that matters is:
    Pr(object is really a tank | current & possibly past images)

Consider the Bayesian approach to cherry picking. If there is no advance planning, one may be tempted to use default flat priors for each treatment comparison.

flat prior \(\rightarrow\) large treatment effects are apriori likely
\(\rightarrow\) if a large effect is observed it may be believed
computing posterior probabilities of many assertions, with flat priors, leads to a kind of multiplicity problem

Instead, if priors with varying degrees of skepticism are carefully pre-specified, these priors allow safe assessments of multiple endpoints/subgroups/looks. Consider a blood pressure reduction trial with these treatments: placebo, exercise program, a drug, and mental telepathy. If the biggest difference is seen for mental telepathy and that treatment is described as the winner in a press release, we would clearly be engaged in cherry picking and the result is extremely unlikely to replicate. A severe frequentist multiplicity adjust may be warranted. The Bayesian would solve the problem by having engaged a number of experts before the study. Undoubtedly all of them would believe in the laws of physics and would put forward highly skeptical priors for the telepathy–placebo comparison. This prior would “pull down” the observed treatment effect to tiny values that are much more likely to replicate in other studies.

With Bayes there are no complicated order-dependent closed testing procedures. It is still a good idea to priority-order the various assertions so that when results are reported the context is clear. Other than that, logical rules of evidence as discussed in the last chapter dictate that the veracity of an assertion be independent of the order in which the assertion is examined and should depend instead on the data and prior knowledge. Post-study evidence for efficacy of mental telepathy should be based on what is known, and the new data, regardless of when telepathy is analyzed. In this sense one does not need to have primary, secondary, co-primary, co-secondary, … endpoints but to be thoughtful about priors and to pre-specify the reporting order.

In the frequentist world, multiplicity comes from the chances you give data to be extreme, not the chances you give true effects to exist.

See Bayesian inference completely solves the multiple comparisons problem by Andrew Gelman

4.4 A More Logical Way to Reduce α

A good deal of the time in frequentist analyses the assertion of a treatment effect when the effect is truly zero is triggered by a small observed treatment effect. Forgetting for the moment that α is irrelevant to forward-in-time probabilities of the truth of assertions, one can reduce α by the following procedure. Instead of seeking evidence for a nonzero treatment effect, seek evidence for a non-trivial effect. For example, compute the posterior probability that an odds ratio is below 0.925. Demanding strong evidence (e.g., posterior probability > 0.95) for a more-than-trivial treatment effect will strongly reduce α, which is computed assuming a zero treatment effect.

This approach also leads to earlier declaration of futility. Assessment of futility based on requiring a tiny probability of any effect often results in prolonging a study only to find a non-clinically important effect at study end.