4  Multiplicity

flowchart TD
multi[Multiplicity<br><br>Subgroups<br>Endpoints<br>Time] --> cp[Opportunities for<br>cherry picking] --> care[Pre-study<br>planning]
care --> cf[Frequentist:<br>very conservative<br>adjustment]
care --> cb[Bayesian:<br>appropriately<br>skeptical priors]
multi --> trans[Transparency<br>of priorities<br>without<br>opportunities<br>for cherry<br>picking<br><br>Pre-study planning]
trans --> tf[Frequentist:<br>subjective<br>non-data-dependent<br>adjustment<br><br>A vs B penalized<br>for comparing A vs C<br><br>Strict pre-specified<br>reporting order<br>should imply<br>less need<br>for adjustment]
trans --> tb[Bayesian:<br>appropriately<br>skeptical priors<br><br>No penalty just<br>for answering<br>>1 question<br><br>Final evidence comes<br>only from data and<br>prior information<br><br>Evidence for A vs B<br>comes strictly<br>from A vs B<br>and prior]

In the Bayesian paradigm evidence for an effect comes from the data and from pre-study information. Outside of hierarchical models for disjoint patient subsets, evidence is not discounted because a different effect was also assessed in the same study. Skepticism about a specific effect is encoded in the prior, not in a somewhat arbitrary label (e.g., co-primary endpoint) that is given to a comparison. Cherry picking only occurs when pre-study due diligence is not given to prior specification.

Frequentist α-spending involves results that may have been obtained and completely ignores results that were obtained since the computation of α does not involve any data. α-spending is not consistent with rules of evidence.

4.1 Background

Before getting into statistical issues, consider various types of multiple statistical assessments made from one dataset.

  • assessing effect of treatment separately on multiple patient types (subgroups)
  • separate effects on different clinical endpoints
  • sequential assessments as the dataset matures, on the overall clinical trial sample and for one endpoint

Then there are the broadest classes of multiplicity:

  • cherry picking where one looks at various subgroups and endpoints and chooses the one to report based on which combination shows the largest treatment effect in the right direction
  • pre-planned multiple assessments where all results are to be reported

It is generally regarded that the first type, cherry picking, deserves the most severe accounting for multiple opportunities. As discussed below, the Bayesian approach to this is very simple with adequate planning, but fails without it. For the first type, Cook & Farewell (1996) argue that as long as cherry picking is disallowed and transparency is enforced, one should be able to ask multiple pre-planned questions and get separate answers without a multiplicity correction.

In the very special case where multiplicity arises solely from assessing a treatment effect in multiple disjoint subgroups, it is better, whether using a frequentist or a Bayesian approach, to “get the model right” rather than to engage in arbitrary after-the-fact multiplicity adjustments. This is done by using a hierarchical modeling, e.g., a Bayesian or frequentist random intercepts model whereby interaction terms involving treatment are shrunken down to what is likely replicable in a future study.

This approach is very often used inappropriately when there is one or more continuous baseline variable involved. For example if one of the variables is age, analysts frequently dichotomize age (e.g. at 65y) and pretend that the differential treatment effect is discontinuous at age 65 and that the treatment effect is constant within age < 65 and within age \(\geq\) 65. Hierarchical models should never be used when subgroups are artificial. Like main effects, continuous baseline variables, in the rare cases where they do interact with treatment, do so in a smooth dose-response manner.

Now consider how multiplicity applies and is addressed in the two major schools of statistics.

4.2 Frequentist

  • Clear that if test more hypotheses (whether they should be connected or not), chance of making assertions ↑ whether or not there is a true nonzero effect
  • No statistical principles that lead to unique solutions
  • Problems magnify with adaptive trials or sample size re-estimation
  • Frame of reference (grouping of hypotheses) often unclear
    • does it include other studies?
    • subgroups? endpoints? data looks?
  • Considers sample space, hypotheses that might be tested + those actually tested
    • violates likelihood principle: Under the chosen statistical model, all of the evidence in a sample relevant to model parameters is contained in the likelihood function (Berry (1987))
  • “Paradox of two sponsors”: sponsor I designed study for one interim look
    • α=0.047 cutoff at final look to preserve overall type I assertion probability at α=0.05
    • first look inconsequential
    • final analysis: p=0.049; drug not approved
  • Sponsor 2: did not do an interim look, same final data as sponsor 1
    • p=0.049; drug approved
  • Early look discounted for planned later looks
  • Later look discounted for inconsequential early looks
  • Unblinded sample size re-estimation: first wave of data discounted to preserve overall α at end of extgended study
  • Consider 4-treatment study: A,B,C,D
    • frequentist assessment of A vs B often discounted because C compared to D
  • None of these mult. adj. are scientifically satisfactory (Blume (2008))

Cook & Farewell (1996), as mentioned above, make a sound argument against arbitrary multiplicity adjustments and their resulting penalization for trying to learn more than one thing from a clinical trial. One can be scientifically accurate and honest by pre-specifying a priority order for the various tests, and presenting p-values in that order regardless of their magnitude.

4.3 Bayesian

  • PPs well calibrated no matter type or number of multiplicities present
  • Skepticism focused on effect of interest, not other effects tested
  • Current posterior density accurately reflects study evidence now
  • Obeys likelihood principle
  • Data and not context for data are important for inference
  • Example: inference about prob θ of an event
    • Bayesian inference identical whether one sampled n=20 subjects and found 5 events or enrolls until 5 events occurred and this happened to require n=20 subjects
    • First design: binomial; second: negative binomial
    • Frequentist: 2 different CLs
    • Bayesian: likelihood of the data same: θ5 (1-θ)15
  • Frequentist significance testing deals with “what would have occurred following results that were not observed at analyses that were never performed” (Emerson (1995))
    • P(test statistic more extreme that observed value) depends on samples that might have arisen
    • Bayes uses only the sample that has arisen
    • Frequentists limit the sample space (e.g. data look frequency) to limit α: requires more planning and results in less flexibility
  • In comparing treatment A vs B in A,B,C,D study, Bayes would discount A vs B only because of prior information for how A might compare to B
  • Sequential trial: current PP self-contained, well calibrated, # looks irrelevant
  • Consider example
    • Probabilistic pattern recognition system to identify enemy targets in combat
    • Initial assessment of P(enemy)=0.3 when target is distant
    • Target comes closer, P(enemy)=0.8
    • Closer, air clears, P(enemy)=0.98
    • Irrelevant that P(enemy) was lower a moment ago
    • P(enemy) may decrease while shell is in the air, but P at time of firing was valid

Consider the Bayesian approach to cherry picking. If there is no advance planning, one may be tempted to use default flat priors for each treatment comparison.

  • flat prior \(\rightarrow\) large treatment effects are apriori likely
  • \(\rightarrow\) if a large effect is observed it may be believed
  • computing posterior probabilities of many assertions, with flat priors, leads to a kind of multiplicity problem

Instead, if priors with varying degrees of skepticism are carefully pre-specified, these priors allow safe assessments of multiple endpoints/subgroups/looks. Consider a blood pressure reduction trial with these treatments: placebo, exercise program, a drug, and mental telepathy. If the biggest difference is seen for mental telepathy and that treatment is described as the winner in a press release, we would clearly be engaged in cherry picking and the result is extremely unlikely to replicate. A severe frequentist multiplicity adjust may be warranted. The Bayesian would solve the problem by having engaged a number of experts before the study. Undoubtedly all of them would believe in the laws of physics and would put forward highly skeptical priors for the telepathy–placebo comparison. This prior would “pull down” the observed treatment effect to tiny values that are much more likely to replicate in other studies.

With Bayes there are no complicated order-dependent closed testing procedures. It is still a good idea to priority-order the various assertions so that when results are reported the context is clear. Other than that, logical rules of evidence as discussed in the last chapter dictate that the veracity of an assertion be independent of the order in which the assertion is examined and should depend instead on the data and prior knowledge. Post-study evidence for efficacy of mental telepathy should be based on what is known, and the new data, regardless of when telepathy is analyzed. In this sense one does not need to have primary, secondary, co-primary, co-secondary, … endpoints but to be thoughtful about priors and to pre-specify the reporting order.

In the frequentist world, multiplicity comes from the chances you give data to be extreme, not the chances you give true effects to exist.

See Bayesian inference completely solves the multiple comparisons problem by Andrew Gelman

4.4 A More Logical Way to Reduce α

A good deal of the time in frequentist analyses the assertion of a treatment effect when the effect is truly zero is triggered by a small observed treatment effect. Forgetting for the moment that α is irrelevant to forward-in-time probabilities of the truth of assertions, one can reduce α by the following procedure. Instead of seeking evidence for a nonzero treatment effect, seek evidence for a non-trivial effect. For example, compute the posterior probability that an odds ratio is below 0.925. Demanding strong evidence (e.g., posterior probability > 0.95) for a more-than-trivial treatment effect will strongly reduce α, which is computed assuming a zero treatment effect.

This approach also leads to earlier declaration of futility. Assessment of futility based on requiring a tiny probability of any effect often results in prolonging a study only to find a non-clinically important effect at study end.