General COVID-19 Therapeutics Trial Design
In the COVID-19 therapeutics arena, commonly used outcomes such as time to recovery have deficiencies. The deficiences can be easily missed. Outcomes such as time to improvement of one or two categories create response variables having different meanings to different patients, because the severity/impact of a given outcome will be different for patients starting at different baseline levels. The root causes are floor and ceiling effects and forgetting that for most ordinal variables the impact of going from, for example, level 2 to level 3 is not the same as going from level 3 to level 4. Ordinal variables should be considered on their own, and adjusted for baseline level, without the use of “change”. Use of current status also allows comparability when a clinical event override is needed. For example, one knows where to put death in comparison to other current patient status levels, but it can be difficult to know where to place death when change in levels is used (it’s even more difficult when considering non-death events such as stroke, MI, ER admission). Ordinal variables used in their “raw” form can easily put a series of clinical events at the top of the scale.
Outcomes such as days on oxygen or ventilation lose statistical power when compared to an analysis that uses all the daily patient status indicators, and they have a fairly serious problem in the handling of interruptions (missing data in a middle visit) or death. Going back to the raw data from which these summary measures are calculated is always a good idea.
These issues have a huge impact on sample size and interpretation. While achieving optimum power, longitudinal analysis not only respects the longitudinal nature of the raw data and allows for missing measurements, but also allows you to derive any clinical readout you want from the final analysis, including
- probability of being well by a certain follow-up time
- expected time until an event
- incidence of mortality
- incidence of mortality or lung dysfunction worse than x
- mean time unwell / mean time requiring organ support
The last readout has been found to be quite appealing to clinicians, and it explicitly takes death into account while elegantly allowing for days/weeks where component outcomes could not be assessed.
Longitudinal outcome analysis can be conducted using either Bayesian or frequentist methods. An overall Bayesian design would have major advanges, including
- the ability to stop slightly earlier with evidence for efficacy
- the ability to stop much earlier for futility
- no need for scheduling data looks; Bayes allows on-demand evidence assessment
- ability to adapt, drop arms, add new arms
- provides direct evidence measures, not the probability of getting surprising data IF the treatment doesn’t work
- directly incorporates clinical significance, e.g., you can compute the probability that the treatment effect is more than a clinically trivial amount
- provides much simpler interpretation of multiple outcomes as the Bayesian evidence for a treatment effect on one outcome is not discounted by the fact that you looked at other outcomes
This document covers general design and analysis issues and recommendations that pertain to COVID-19 therapeutic randomized clinical trials. Experimental designs are particular to therapies and target patients so are covered only briefly, other than sequential designs which are quite generally applicable. Some of the issues discussed here are described in more detail at hbiostat.org/proj/covid19 where several sequential clinical trial simulation studies may also be found.
For reasons detailed below, we propose that the default design for COVID-19 therapeutic studies be Bayesian sequential designs using high-information/high-power ordinal outcomes as overviewed in this video.
The material on selection and construction of outcome variables applies equally to Bayesian and traditional frequentist statistical analysis. The following Design section includes some general material and lists several advantages of continuous learning Bayesian sequential designs. Before discussing design choices, we summarize the most common statistical pitfalls in RCTs and contrast frequentist and Bayesian methods. This is important because we recommend that a Bayesian approach be used in the current fast-moving environment to maintain flexibility and accelerate learning.
1 Most Common Pitfalls in Traditional RCT Designs
The most common outcome of a randomized clinical trial is a p-value greater than some arbitrary cutoff. In this case, researchers who are aware that absence of evidence is not evidence of absence will conclude that the study is inconclusive (especially if the confidence interval for the treatment difference is wide). More commonly, the researchers or a journal editor will conclude that the treatment is ineffective. This should raise at least five questions.
- What exactly is the evidence that the active treatment results in similar patient outcomes as control?
- What was the effect size assumed in the power calculation, and was it greater than a clinically relevant effect? In other words was the sample size optimistically small?
- Did the outcome have the most statistical power among all clinically relevant outcomes?
- What would have happened had the study been extended? Would it have gone from an equivocal result to a definitive result?
- If the conclusion is “lack of efficacy”, could we have reached that conclusion with a smaller number of patients randomized by using a sequential design?
Powering a study to detect a miracle when all that happened is a clinically important effect is unfortunately all too common. So is the use of inefficient outcome variables. Fixed sample size designs, though easy to understand and budget, are a key cause of wasted resources and all too frequently result in uninformative studies. The classical sample size calculation assumes a model, makes assumptions about patient-to-patient variability or event incidence (both of these assumptions are in common between frequentist and Bayesian approaches) and then assumes an effect size “not to miss”. The effect size is usually overoptimistic to make the budget palatable. A continuously sequential Bayesian design allows one to run the study until
- there is strong evidence for efficacy
- there is moderately strong evidence for harm
- there is moderately strong evidence for similarity
- the probability of futility is high, e.g., the Bayesian predictive probability of success is low given the current data even if the study were to progress to the maximum affordable sample size
The idea of experimenting until one has an answer, though routinely practiced by physicists, is underutilized in medicine.
A second major statistical pitfall is inflexibility such that the design cannot be modified in reaction to changing disease or medical practice patterns after randomization begins, otherwise one would not know how to compute a p-value as this requires repeated identical sampling.
See this for a full Bayesian design that solves many of the problems in clinical trials.