9  Bayesian Clinical Trial Design

flowchart LR
D[Study Design] --> G[Desired Conclusions] & UD
G --> Des[Efficacy<br>Non-trivial efficacy<br>Inefficacy<br>Harm<br>Non-inferiority<br>Similarity<br>Estimation]
UD[Avoid] --> NO[No conclusion at<br>maximum N]
D --> OC[Desired Operating<br>Characteristics]
OC --> C & P & ST & Prec
C[Correctness<br>of conclusions]
P[High probability<br>of detecting effect=MCID at<br>the average sample size]
ST[Expected stopping time]
Prec[Adequate precision of estimated effect<br>if evidence for effectiveness]
D --> Ch[Challenges] --> chl[Understanding that correctness of the conclusion is the only relevant OC for post-data decision making<br>Not allowing gaming of effect size used to choose N<br>Willingness to stop very early for futility<br>Whether to stop early for similarity<br>Formalizing minimum precision criterion<br>Specificity and objectivity about triviality of treatment effects]


flowchart TD
N["N<br>Frequentist: 234 for 0.9 power<br>Bayesian 0.9 for Δ > 0: 196<br>Bayesian 0.9 for Δ > γ: 282<br>Bayesian sequential N avg: 113<br>Bayesian Np=60 min N for precision<br>Max N: 750"]
BS[Bayesian Simulation]
BS --> SM[Different effect Δ for each trial]
SM --> G[Bayesian goal:<br>Uncover Δ generating the data]
G --> P[Universe of Δ<br>Chosen to be disadvantageous] --> NT[10,000 Trials] 
NT --> S[Sequential assessment<br>unlimited looks]
S --> SI[Stop at any time<br>for inefficacy]
S --> SS[Stop at any time<br> for similarity]
S --> NON[No conclusion]
S --> SE["Stop any time N≥Np for<br>non-trivial efficacy"]
SI --> SIN[5184 Stopped early<br>Avg N: 62] --> SIC["5020 correct decision for Δ < γ (0.97)"]
SS --> SSN[634 Stopped early<br>Avg N: 423] --> SSC["607 correct decision (0.96)"]
SE --> SEN[4010 Stopped early<br>Avg N: 102] --> SEC["3643 correct decision for Δ > γ (0.91)<br>3913 correct decision for Δ > 0 (0.98)"]
NON --> NONN[172 went to N=750] --> MD[Median Δ = γ]

Source: hbiostat.org/bayes/bet/design

Asking one to compute type I assertion probability α for a Bayesian design is like asking a poker player winning > $10M/year to justify his ranking by how often he places bets in games he didn’t win.

Do you want the probability of a positive result when there is nothing? Or the probability that a positive result turns out to be nothing?

Details in Section 9.9

9.1 Goals of a Therapeutic Trial

There are many goals of comparing treatments, including quantifying evidence for

  • any efficacy
  • non-trivial efficacy
  • inefficacy
  • harm
  • similarity
  • non-inferiority

and sometimes the goal is just to estimate the treatment effect, whatever it is. Bayesian methods provide probabilities that each of the above conclusions is correct, given the data, the data model, and the prior distribution for the treatment effect. For estimation, Bayes provides a complete probability distribution for the unknown effect, and corresponding interval estimates (highest posterior density intervals, credible intervals). With the enormous number of traditional frequentist clinical trials that reach their planned fixed sample size but have equivocal results, one particular advantage of Bayesian sequential designs is all-important: minimizing the chance of getting to the maximum sample size with no conclusion about efficacy.

Also, the Bayesian approach prevents the all-too-common absence of evidence is not evidence of absence error arising from attempting to draw a conclusion from a large P-value.

A major advantage of Bayes is that it provides evidence in favor of assertions, not just evidence against a hypothesis. That explains why Bayesian designs and analyses are the same (other than the assertion whose probability is being computed) for all study goals. Only the required sample size may differ by the trial’s intent.

Another major advantage of Bayes is that since it involves probabilities about unknowns given the current data, it does not have multiplicity problems with sequential evidence assessment (Chapter 5). Evidence from current data are not down-weighted because the data were examined previously or because they will be examined in the future. Bayesian trial design need not have a sample size, and if it does, there is nothing wrong with exceeding that sample size when results are equivocal. This implies that Bayesian designs are less likely to result in a trial having an equivocal result. And the more often one analyzes the data the lower the expected sample size. One can stop as soon as there is a pre-specified amount of evidence for one of the conclusions listed above. When sample size N is discussed below, it can refer to a planned maximum sample size, an average sample size, the 75\(^\text{th}\) percentile of a distribution of sample sizes, etc.

Evidence from current data is the point of emphasis. Correctness of a decision is the probability that a decision made using the current data is correct. In frequentism, multiplicity comes from multiple chances to observe extreme data, which grows as the number of data looks increases. Bayesian posterior probabilities are about efficacy, not about data.

With a Bayesian sequential design it is easy to add a requirement that the treatment effect be estimated with a specified precision. Often this precision (e.g., half-width of a credible interval) can be estimated before the trial starts. Then the precision-required sample size can define the time of the first data look.

9.2 Challenges

  • Change conceptions about operating characteristics
    • Primary OC must be the correctness of whatever decision is made
  • Be willing to stop a study very early when there is a very low probability of non-trivial efficacy
  • Whether to stop a study early for a conclusion of similarity of efficacy of the two treatments
  • Abandon need for estimating treatment effect when there is ample evidence for inefficacy
  • Be objective and formal about interpretation of effects by specifying a threshold for non-trivial effect
  • Removing cost from consideration of effect sizes to detect in favor of finding an outcome variable that will have rich enough information to achieve power with clinically-chosen effect size
  • Regulators and sponsors must recognize that the operating characteristic that needs to be optimized is the correctness of whatever decision is made about treatment effectiveness.
  • Sponsors must be willing to stop studies very early when there is strong evidence of inefficacy. This will free up patients with rare diseases for other trials, and free up resources likewise.
  • To be able to stop studies very early so as to not waste resources when the probability is very high that there is no non-trivial efficacy, sponsors must jettison the need to be able to estimate how ineffective a treatment is.
  • At present, consideration of whether a nonzero treatment effect is trivial is highly subjective and informal. To make optimal and defensible decisions, this must be formalized by specifying a threshold for non-triviality of effect. Importantly, this threshold is smaller than the MCID.
  • The current practice of choosing a more than clinically important effect size to use in power calculations has been called “sample size samba” and has resulted in a scandalous level of overly optimistically designed trials, commonly having an equivocal result. Many equivocal trials observed an effect that was merely clinically significant. Powering the study to only detect a “super clinically significant” effect results in low actual power and wasted resources. Objectively choosing effect sizes without cost considerations, then abandoning studies that cannot find an acceptable outcome variable that has sufficient resolution to achieve adequate power with a “real” effect size will prevent a large amount of waste and make clinical trials more ethical while freeing up patients with rare diseases to participate in other trials.

9.3 Bayesian Operating Characteristics

The primary Bayesian operating characteristic is how often a conclusion is correct. The primary frequentist operating characteristic is how often one reaches a conclusion.

Bayesian inference allows one to make both positive (“the drug works”) and negative (“the drug doesn’t work”) conclusions. Evaluation of Bayesian operating characteristics can focus on either the probability that whatever decision is made is correct, or the correctness of a single decision such as a drug works. Frequentism based on null hypothesis testing and P-values only brings evidence against a hypothesis, so the only possible actions are (1) to reject \(H_0\) (the one actual conclusion a frequentist approach can make), (2) get more data, or (3) stop with no conclusion except a failure to reject \(H_0\). The operating characteristic given greatest emphasis by regulators has been type I assertion probability \(\alpha\), the probability of rejecting \(H_0\) when it is magically known to be true.

When designs are being formulated and when N is being selected, researchers need to know that the typical way the trial is likely to unfold is acceptable. The performance of a proposed Bayesian design must be judged by its Bayesian operating characteristics (OCs). Bayes is about prospective analysis and interpretation, i.e., trying to uncover the unknown based on what is known. It minimizes the use of unobservables in acting in forward information-flow predictive mode. Frequentist OCs deal with what may happen under unknowable assumptions about the treatment effect generating the data. Bayesian OCs are the only OCs important when running Bayesian analyses, interpreting the results, and knowing how much to trust them. A key point is that accuracy of conclusions pertains to what has been learned from data, not what data may arise under certain unknowable assumptions about effects. Accuracy is a post-data consideration.

Bayesian OCs are of four types. First there is assurance, which is the ability of the design to detect a real treatment effect were it to exist. This is Bayesian power (the chance of achieving a high posterior probability of efficacy). Second is precision of post-data knowledge about the treatment effect, e.g., half of the width of an uncertainty interval. Third is the expecting stopping time, which is related to the efficiency of the design and the expected cost of the study. Fourth is the all-important judgment of the correctness of the decision at the moment the decision is made. Bayesian approaches optimize the chance of conclusions being correct if the following three conditions hold:

The first two conditions are in common with traditional frequentist approaches.
  • the variables being analyzed are measured unbiasedly
  • the data model is consistent with the experimental design and is reasonably well-fitting
  • the prior used in the analysis is not in conflict with the prior used by the judge of the accuracy of the study results

Quantifying correctness of conclusions for a Bayesian design is equivalent to assessing predictive accuracy, which involves examining what’s real as a function of predicted values. Correctness of conclusions (predictions) is easy to check in a simulation study, as explained in detail in Section 9.9. The process involves looking back to the treatment effect that generated the current trial data after the conclusion about the treatment effect has been made. After a large number of trials are simulated, the subset of trials reaching a certain conclusion is chosen, and for that subset the proportion of trials for which the true treatment effect is consistent with the conclusion is computed.

When Bayes is used to the fullest, decisions take consequences into account through a utility, loss, or cost function. The optimum Bayes decision that maximizes expected utility is a function of the posterior distribution of the treatment effect and the utility function (decision consequences). But different stakeholders have different utility functions, so we are usually content to deal only with the posterior distribution, a major input into the optimum decision.

In actual studies most of the attention about this most important OC is focused on the choice of the prior distribution. When the prior is acceptable, the Bayesian procedure will result in optimum decisions in the eye of the beholder of the prior, by definition. Sensitivity of decisions to the choice of a prior would be a useful assessment at the design stage.

In Chapter 5 Bayesian sequential analysis is studied in more detail, though the simulations there are simple from the standpoint of having only one reason for stopping. In that situation, one stops, for example, when a posterior probability PP exceeds 0.95. When the simulation is done using the same prior as the prior used in the analysis, if the PP is 0.96 upon stopping, the true probability of a treatment effect will be exactly 0.96. Trusting the Bayesian results at the decision point requires the PP being well calibrated. This happens when there is no major conflict between the design prior and the analysis prior.

9.4 Example: Two-Treatment Parallel-Group Efficacy Design


  • Maximize the probability of making the right decision, with 4 decisions possible (efficacy, similarity, inefficacy, not possible to draw any conclusion with available resources)
  • Minimize the probability of an inconclusive result at study end
  • Minimize the expected sample size
  • Stop as early as possible if there is a very high probability that the treatment’s effect is trivial or worse, even if the magnitude of the effect cannot be estimated with any precision
  • If stopping early for non-trivial efficacy, make sure that the sample size provides at least crude precision of the effect estimate
    • Do this by not making the first efficacy look until the lowest sample size for which at least crude precision of the treatment effect is attained
  • Avoid gaming the effect size used in a power calculation
  • Recognize that traditional sample sizes are arbitrary and require too many assumptions
  • Recognize that what appears to be a study extension to one person will appear to be an interim analysis to another who doesn’t need to do a sample size calculation
  • Build clinical significance into the quantification of evidence for efficacy


  • \(\Delta\): true treatment effect being estimated
  • \(\delta\): minimum clinically important \(\Delta\)
  • \(\gamma\): minimum detectable \(\Delta\) (threshold for non-trivial treatment effect, e.g. \(\frac{1}{3}\delta\))
  • SI: similarity interval for \(\Delta\), e.g., \([-\frac{1}{2}\delta, \frac{1}{2}\delta]\)
  • \(N\): average sample size, used for initial resource planning. This is an estimate of the ultimate sample size based on assuming \(\Delta = \delta\). \(N\) is computed to achieve a Bayesian power of 0.9 for efficacy, i.e., achieving a 0.9 probability at a fixed sample size that the probability of any efficacy exceeds 0.95 while the probability of non-trivial efficacy exceeds 0.85, i.e., \(\Pr(\Delta > 0) > 0.95\) and \(\Pr(\Delta > \gamma) > 0.85\).
  • More than one \(N\) can be computed, e.g., \(N\) such that if \(\Delta = \gamma\) there is ever a high probability of stopping for inefficacy
  • \(N_p\): minimum sample size for first efficacy assessment, based on required minimum precision of estimate of \(\Delta\)
  • Probability cutoffs are examples
  • This design does not require a separate futility analysis, as inefficacy subsumes futility
  • Below, “higher resolution Y” can come from breaking ties in the outcome variable (making it more continuous), adding longitudinal assessments, or extending follow-up

flowchart TD
A[Choose response variable Y with<br>maximum resolution that is<br>relevant to the patient] --> B[Elicit effect sizes δ, γ for Y from experts and patients<br>who are unaware of budget limits] --> D[Compute average sample size N<br>and minimum efficacy sample size Np] --> E[Is N feasible?]
E --> |No| G[Can a higher resolution Y be developed?] --> |No| I[Abandon Study]
G --> |Yes| B
E --> |Yes| L[Begin study<br>Do frequent sequential analysis] --> N["Pr(Δ > γ) < 0.025?"]
N --> |Yes| P[Stop for inefficacy<br>Low chance of non-trivial effect]
N --> |No| Np["Have ≥ Np participants finished<br>their first follow-up?"]
Np --> |No| Y 
Np --> |Yes| R["Pr(Δ > 0) > 0.95 and Pr(Δ > γ) > 0.85"] --> |Yes| T[Stop for non-trivial efficacy]
R --> |No| R2["Pr(Δ in SI) > 0.95"] --> |Yes| R3[Stop for similarity]
R2 --> |No| V2[Resources exhausted?] --> |No| Y[Accrue more participants] --> N
V2 --> |Yes| Z[Stop with inability<br>to reach any<br>conclusions] 

Example operating characteristics for this design are simulated in Section 9.9.

  • Bayesian inference opens up a wide variety of adaptive trials
    • stat methods need no adaptations
    • only sample size and Bayesian power calculations are complicated
  • But here consider non-adaptive trials with possible sequential monitoring

9.5 Sample Size Estimation

  • In a sense more honest than frequentist b/c uncertainty admitted
    • SDs
    • effect size
  • Whitehead et al. (2015): sample size calculation for control vs. set of active tx
    • goal: high PP that ≥ 1 tx better than control or
    • high PP that none is better
  • Other good references: Kunzmann et al. (2021), Spiegelhalter et al. (1993), Grouin et al. (2007), Spiegelhalter & Freedman (1986), Adcock (1997), Joseph & Bélisle (1997), Pezeshk & Gittins (2002), Simon & Freedman (1997), Wang et al. (2005), Braun (n.d.), this
  • Spiegelhalter et al. (1993) has clear rationale for unconditional power integrating over prior
    • “realistic assessment of the predictive probability of obtaining a ‘significant’ result”
    • discusses damage caused by using fixed optimistic treatment effect
  • Uebersax has an excellent article on Bayesian unconditional power analysis which also discusses the hazards of choosing a single effect size for power estimation

9.6 Bayesian Power and Sample Size Examples

  • Example notion of power: probability that PP will exceed a certain value such as 0.95 (Bayesian assurance)
  • Compute as a function of the unknown effect, see which true effects are detectable for a given sample size
  • If the minimum clinically important difference (MCID) is known, compute the success probability at this MCID
  • Better: take uncertainty into account by sampling the MCID from an uncertainty (prior) distribution for it, using a different value of MCID for each simulated trial. The Bayesian power (probability of success) from combining all the simulations averages over the MCID uncertainties. This is demonstrated in a time-to-event power calculation below.

9.6.1 One-Sample Mean Example

  • Consider earlier one-sample mean problem
  • Data are normal with variance 1.0 and the prior is normal with mean zero and variance \(\frac{1}{\tau}\)
  • Take \(\mu > 0\) to indicate efficacy
  • PP \(= P(\mu > 0 | \textrm{data}) = \Phi(\frac{n \overline{Y}}{\sqrt{n + \tau}})\)
  • \(\overline{Y}\) is normally distributed with mean \(\mu\) and variance \(\frac{1}{n}\)
  • Long-run average of \(\overline{Y}\) is \(\mu \rightarrow\) long-run median of \(\overline{Y}\) is also \(\mu \rightarrow\) median PP \(= \Phi(\frac{n \mu}{\sqrt{n + \tau}})\)
  • Let \(z = \Phi^{-1}(0.95)\)
  • Chance that PP exceeds 0.95 is \(P(\Phi(\frac{n \overline{Y}}{\sqrt{n + \tau}}) > 0.95) = P(\frac{n \overline{Y}}{\sqrt{n + \tau}} > z) = \Phi(\frac{n \mu - z \sqrt{n + \tau}}{\sqrt{n}})\)
  • For varying \(n\) and \(\tau\) the median PP and the Bayesian power are shown in the figure
d <- expand.grid(mu=seq(-2, 2, length=200),
                 n=c(1, 2, 5, 10),
                 tau=c(0, 2, 10),
d <- transform(d,
               p = ifelse(what == 1, 
                    pnorm(n * mu / sqrt(n + tau)),
                    pnorm((n * mu - qnorm(0.95) * sqrt(n + tau)) / sqrt(n))),
               n = factor(n),
               tau = factor(tau)
d$Probability <- factor(d$what)
levels(d$n)   <- paste0('n==', levels(d$n))
levels(d$tau) <- paste0('tau==', levels(d$tau))

ggplot(d, aes(x=mu, y=p, col=Probability)) + geom_line() +
  scale_color_discrete(labels=expression(Median~P(mu > 0),
                                         P(P(mu > 0) > 0.95))) +
  facet_grid(tau ~ n, labeller=label_parsed) +
  xlab(expression(mu)) +

Bayesian power and median posterior probability for varying true effect values, sample sizes, and degrees of skepticism

9.6.2 Sample Size for Comparing Time to Event

In this example we compute the sample size needed to achieve adequate Bayesian power to achieve evidence for various assertions. The treatment effect is summarized by a hazard ratio (HR) in a proportional hazards (PH) model. Assertions for which the study needs to have the ability to provide conclusive evidence include

  • any reduction of hazard of an event (HR < 1)
  • HR < 0.98, 0.96, 0.94, …
  • harm or trivial benefit (HR > 0.95)

The trial’s actual analysis might use a flexible parametric PH model implemented in the rstanarm Bayesian survival analysis package. Bayesian operating characteristics are simulated here using the Cox PH model.

Take Bayesian power to be the probability that the posterior probability will exceed 0.975 when the true HR is a constant1. For the first two types of assertions, set the HR generating the data to 0.75 (assumed minimally clinically important difference), and for the third type of assertion set HR to 1.25.

1 Ideally one would consider power over a distribution of HRs as done below.

The actual analysis to be used for the clinical trial should adjust for covariates, but covariates are ignored in our simulations. Bayesian power will increase with the addition of covariates.

If the treatment groups are designated A (standard of care) and B (new treatment), suppose that the following Weibull survival distribution has been found to fit patient’s experiences under control treatment.

\[S(t) = e^{- 0.9357 t^{0.5224}}\] The survival curve for treatment B patients is obtained by multiplying 0.9357 by the B:A hazard ratio (raising \(S(t)\) to the HR power).

Assume further that follow-up lasts 3y, 0.85 of patients will be followed a full 3y, and 0.15 of patients will be followed 1.5y when the final analysis is done2.

2 The R Hmisc package spower function will simulate much more exotic situations.

To compute Bayesian power, use the normal approximation to the Cox partial likelihood function, i.e., assume that log(HR) is normal and use standard maximum likelihood methods to estimate the mean and standard deviation of the log(HR). The Cox log-likelihood is known to be very quadratic in the log HR, so the normal approximation will be very accurate. For a prior distribution assume a very small probability of 0.025 that HR < 0.25 and a probability of 0.025 that it exceeds 4. The standard deviation for log(HR) that accomplishes this is \(\frac{\log(4)}{1.96} = 0.7073 = \sigma\).

With a normal approximation for the data likelihood and a normal prior distribution for log(HR), the posterior distribution is also normal. Let \(\theta\) be the true log(HR) and \(\hat{\theta}\) be the maximum likelihood estimate of \(\theta\) in a given simulated clinical trial. Let \(s\) be the standard error of \(\hat{\theta}\)3.

3 Note that in a proportional hazards model such as the one we’re using, the variance of \(\hat{\theta}\) is very close to the sum of the reciprocals of the event frequencies in the two groups.

Let \(\tau = \frac{1}{\sigma^{2}} = 1.9989\) and let \(\omega = \frac{1}{s ^ {2}}\). Then the approximate posterior distribution of \(\theta\) given the data and the prior is then normal with mean \(\mu\) and variance \(v\) where \(v = \frac{1}{\tau + \omega}\) and \(\mu = \frac{\omega \hat{\theta}}{\tau + \omega}\).

The posterior probability that the true HR < \(h\) is the probability that \(\theta < \log(h)\), which is \(\Phi(\frac{\log(h) - \mu}{\sqrt{v}})\).


First consider code for generating outcomes for one clinical trial of size \(n\) per group, follow-up time cutoff \(u\), secondary cutoff time \(u2\) with probability \(pu2\), and true hazard ratio \(H\). The result is posterior mean and variance and the P-value from a frequentist likelihood ratio \(\chi^2\) test with the Cox PH model.

sim1 <- function(n, u, H, pu2=0, u2=u, rmedian=NA) {
  # n: number of pts in each group
  # u: censoring time
  # H: hazard ratio
  # pu2: probability of censoring at u2 < u
  # rmedian: ratio of two medians and triggers use of log-normal model

  # Generate failure and censoring times for control patients
  # Parameters of a log-normal distribution that resembles the Weibull
  # distribution are below (these apply if rmedian is given)
  mu    <- -0.2438178
  sigma <- 1.5119236
  ta <- if(is.na(rmedian))
    (- log(runif(n)) / 0.9357) ^ (1.0 / 0.5224)
  else exp(rnorm(n, mu, sigma))
  ct <- pmin(u, ifelse(runif(n) < pu2, u2, u))
  ea <- ifelse(ta <= ct, 1, 0)
  ta <- pmin(ta, ct)   # apply censoring
  # Generate failure and censoring times for intervention patients
  tb <- if(is.na(rmedian))
    (- log(runif(n)) / (H * 0.9357)) ^ (1.0 / 0.5224)
  else exp(rnorm(n, mu + log(rmedian), sigma))
  ct <- pmin(u, ifelse(runif(n) < pu2, u2, u))
  eb <- ifelse(tb <= ct, 1, 0)
  tb <- pmin(tb, ct)
  ti <- c(ta, tb)
  ev <- c(ea, eb)
  tx <- c(rep(0, n), rep(1, n))
  f <- if(is.na(rmedian)) coxph(Surv(ti, ev) ~ tx) else
    survreg(Surv(ti, ev) ~ tx, dist='lognormal')
  chi <- 2 * diff(f$loglik)
  theta.hat <- coef(f)[length(coef(f))]
  s2        <- vcov(f)['tx', 'tx']
  tau       <- 1.9989
  omega     <- 1.0 / s2
  v         <- 1.0 / (tau + omega)
  mu        <- omega * theta.hat / (tau + omega)
  if(! all(c(length(mu), length(v), length(chi))) == 1) stop("E")
  list(mu   = as.vector(mu),
       v    = as.vector(v),
       pval = as.vector(1.0 - pchisq(chi, 1)))

Simulate all statistics for \(n=50, 51, ..., 1000\) and HR=0.75 and 1.25, with 10 replications per \(n\)/HR combination.

HR <- 0.75
ns <- seq(50, 1000, by=1)
R  <- expand.grid(n = ns, HR=c(0.75, 1.25), rep=1 : 10)
setDT(R)   # convert to data.table
g <- function() {
  R[, sim1(n, 3, HR, 0.15, 1.5), by=.(n, HR, rep)]
S <- runifChanged(g, R, sim1, file='wsim.rds')

The number of trials simulated for each of the two HR’s was 9510.

Before simulating Bayesian power, simulate frequentist power for the Cox PH likelihood ratio \(\chi^2\) test with two-sided \(\alpha=0.05\). Effect size to detect is HR=0.75.

R <- S[HR == 0.75, .(reject = 1 * (pval < 0.05)), by=n]
# Compute moving proportion of trials with p < 0.05
# Use overlapping windows on n and also logistic regression with
# a restricted cubic spline in n with 5 knots
m <- movStats(reject ~  n, eps=100, lrm=TRUE,
              data=R, melt=TRUE)
       aes(x=n, y=reject, linetype=Type)) + geom_line() +
  geom_hline(yintercept=0.9, alpha=0.3) +
  scale_x_continuous(breaks=seq(0, 1000, by=100)) +
  scale_y_continuous(breaks=seq(0, 1, by=0.1),
                     minor_breaks=seq(0, 1, by=0.05)) +
  ylab('Pr(reject H0:HR=1)')

Use linear interpolation to solve for the sample size needed to have 0.9 frequentist power.

lookup <- function(n, p, val=0.9)
  round(approx(p, n, xout=val)$y)
m[Type == 'LR', lookup(n, reject)]
[1] 333

For each trial simulated with HR=0.75, compute posterior probabilities that the true HR \(< h\) for various HR cutoffs \(h\). Both moving proportions and logistic regression with a restricted cubic spline in \(n\) are used to smooth the results to estimate the relationship between \(n\) and Bayesian power, i.e., the probability that the posterior probability reaches 0.975 or higher.

R <- S[HR == 0.75]
hs <- seq(0.8, 1.0, by = 0.02)
pp <- function(mu, v) list(h=hs, p=pnorm((log(hs) - mu) / sqrt(v)))
R <- R[, pp(mu, v), by=.(n, rep)]
# Compute moving proportion with posterior prob > 0.975
# with overlapping windows on n, and separately by h
R[, phi := 1 * (p >= 0.975)]
m <- movStats(phi ~  n + h, eps=100, lrm=TRUE, lrm_args=list(maxit=30),
              data=R, melt=TRUE)
       aes(x=n, y=phi, color=h, linetype=Type)) + geom_line() +
  geom_hline(yintercept=0.9, alpha=0.3) +
  scale_x_continuous(breaks=seq(0, 1000, by=100)) +
  scale_y_continuous(breaks=seq(0, 1, by=0.1),
                     minor_breaks=seq(0, 1, by=0.05)) +
  ylab('Pr(Posterior Pr(HR < h) > 0.975)')

Use linear interpolation to compute the \(n\) for which Bayesian power is 0.9.

u <- m[Type == 'LR', .(nmin=lookup(n, phi)), by=h]
         h  nmin
    <char> <num>
 1:    0.8    NA
 2:   0.82    NA
 3:   0.84    NA
 4:   0.86    NA
 5:   0.88    NA
 6:    0.9   866
 7:   0.92   687
 8:   0.94   584
 9:   0.96   489
10:   0.98   394
11:      1   339
Nprime <- u[h == 1, nmin]

Check the Bayesian power to detect a trivial or worse treatment benefit when true HR is 1.25.

HR <- 1.25
hs <- 0.95
R <- S[HR == 1.25, pp(mu, v), by=.(n, rep)]
# One minus because we want Pr(HR > 0.95)
R[, phi := 1 * (1.0 - p > 0.975)]
m <- movStats(phi ~ n, eps=100, lrm=TRUE, data=R, melt=TRUE,
ggplot(m, aes(x=n, y=phi, linetype=Type)) + geom_line() +
  ylab('Pr(Posterior Pr(HR > 0.975))') +
  geom_hline(yintercept=0.9, alpha=0.3) +
  scale_x_continuous(breaks=seq(0, 1000, by=100)) +
  scale_y_continuous(breaks=seq(0, 1, by=0.1),
                     minor_breaks=seq(0, 1, by=0.05))

Find \(n\) where it crosses 0.9.

m[Type == 'LR', lookup(n, phi)]
[1] 328

Accounting for Uncertainty in MCID

Different reviewers and journal readers may have different ideas of a minimal clinically important difference to detect. Instead of setting the MCID to HR=0.75, suppose that all HR in the interval [0.65, 0.85] are equally likely to be the MCID at the time the study findings are made known. Let’s re-simulate 1000 trials to compute Bayesian power at a sample size of 339 per group, varying the MCID over [0.65, 0.85].

R  <- expand.grid(n = Nprime, HR=runif(1000, 0.65, 0.85))
setDT(R)   # convert to data.table
S <-  R[, sim1(n, 3, HR, 0.15, 1.5), by=.(n, HR)]
pp <- function(mu, v) pnorm((log(1.0) - mu) / sqrt(v))
bp <- S[, mean(pp(mu, v) > 0.975)]
[1] 0.848

The Bayesian power drops from 0.9 to 0.848 when the stated amount of uncertainty is taken into account for MCID.

9.6.3 Bayesian Power for 3-Level Ordinal Outcome

Suppose that the above time-to-event outcome is primary, and that there is a secondary outcome of infection within 90 days of randomization. To explicitly account for deaths within the 90-day window we use a 3-level proportional odds ordinal logistic model with 0=alive and infection-free, 1=alive and infection, 2=dead (whether or not preceded by infection). For the actual study we may use a partial proportional odds model to allow for a different treatment effect for the union of infection and death as compared to the effect for death alone. A prior distribution can be specified for the ratio of the two odds ratios (for infection + death and for death) such that there is a 0.95 chance that the ratio of odds ratios is between \(\frac{1}{3}\) and \(3\). A Bayesian power simulation with that amount of borrowing between the treatment effect for the two outcomes is presented at the end of this section.

Use the Weibull model described above to estimate 90-day mortality.

rnd <- function(x) round(x, 3)
S   <- function(t) exp(- 0.9357 * t ^ 0.5224)
m90 <- 1. - S(90 / 365.25)
m30 <- 1. - S(30 / 365.25)

The estimated mortality by 90d is 0.362. The estimated probability of infection for 90-day survivors is 0.2, so the ordinal outcome is expected to have relative frequencies listed in the first line below.

# Hmisc posamsize and popower function cell probabilities are averaged
# over the two groups, so need to go up by the square root of the odds 
# ratio to estimate these averages

phigh <- c(1. - 0.2 - m90, 0.2, m90)
or    <- 0.65
pmid  <- pomodm(p=phigh, odds.ratio=sqrt(or))
plow  <- pomodm(p=pmid, odds.ratio=sqrt(or))
phigh <- rnd(phigh)
pmid  <- rnd(pmid)
plow  <- rnd(plow)
n <- round(posamsize(pmid, odds.ratio=or, power=0.8)$n / 2)
Infpow <- round(popower(pmid, odds.ratio=or, n=2 * Nprime)$power, 3)
[1] 0.844
Treatment Y=0 Y=1 Y=2
Standard Tx 0.438 0.2 0.362
New Tx 0.545 0.185 0.27
Average 0.491 0.195 0.314

The second line in the table are the outcome probabilities when an odds ratio of 0.65 is applied to the standard treatment probabilities. The last line is what is given to the posampsize and popower functions to compute the sample size need for an \(\alpha=0.05\) two-sided Wilcoxon/proportional odds test for differences in the two groups on the 3-level outcome severity with power of 0.8. The needed sample size is 301 per group were a frequentist test to be used. The frequentist power for the sample size (339 per group) that will be used for the primary outcome is 0.844.

For Bayesian power we use a process similar to what was used for the proportional hazards model above, but not quite trusting the quadratic approximation to the log likelihood for the proportional odds model, due to discretenesss of the outcome variable. Before simulating a power curve for all sample sizes using the normal approximation to the data likelihood, we do a full Bayesian simulation of 1000 trials with 300 patients per group in addition to using the approximate normal prior/normal likelihood method.

Here is a function to simulate one study with an ordinal outcome.

options(mc.cores=4, rmsb.backend='cmdstan')

sim1 <- function(n, pcontrol, or, orcut=1, bayes=FALSE) {
  # n: number of pts in each group
  # pcontrol: vector of outcome probabilities for control group
  # or: odds ratio
  # orcut: compute Pr(OR < orcut)

  # Compute outcome probabilities for new treatment
  ptr <- pomodm(p=pcontrol, odds.ratio=or)
  k <- length(pcontrol) - 1
  tx <- c(rep(0, n), rep(1, n))
  y  <- c(sample(0 : k, n, replace=TRUE, prob=pcontrol),
          sample(0 : k, n, replace=TRUE, prob=ptr) )
  f <- lrm(y ~ tx)
  chi <- f$stats['Model L.R.']
  theta.hat <- coef(f)['tx']
  s2        <- vcov(f)['tx', 'tx']
  tau       <- 1.9989
  omega     <- 1.0 / s2
  v         <- 1.0 / (tau + omega)
  mu        <- omega * theta.hat / (tau + omega)
  if(! all(c(length(mu), length(v), length(chi))) == 1) stop("E")
  postprob <- NULL
  if(bayes) {
    # Set prior on the treatment contrast
    pcon <- list(sd=0.7073, c1=list(tx=1), c2=list(tx=0),
                 contrast=expression(c1 - c2) )
    f <- blrm(y ~ tx, pcontrast=pcon,
              loo=FALSE, method='sampling', iter=1000)
    dr <- f$draws[, 'tx']
    postprob <- mean(dr < log(orcut))  # Pr(beta<log(orcut)) = Pr(OR<orcut)
  c(mu   = as.vector(mu),
    v    = as.vector(v),
    pval = as.vector(1.0 - pchisq(chi, 1)),
    postprob = postprob)

# Define a function that can be run on split simulations on
# different cores

g <- function() {
  nsim <- 1000
  # Function to do simultions on one core
  run1 <- function(reps, showprogress, core) {
    R <- matrix(NA, nrow=reps, ncol=4,
                dimnames=list(NULL, c('mu', 'v', 'pval', 'postprob')))
    for(i in 1 : reps) {
      showprogress(i, reps, core)
      R[i, ] <- sim1(300, phigh, 0.65, bayes=TRUE)
  # Run cores in parallel and combine results
  # 3 cores x 4 MCMC chains = 12 cores = machine maximum
  runParallel(run1, reps=nsim, seed=1, along=1, cores=3)
R <- runifChanged(g, file='k3.rds')   # 15 min.

Compute frequentist power and exact (to within simulation error) and approximate Bayesian powper.

u <- as.data.table(R)
pp <- function(mu, v) pnorm((log(1.0) - mu) / sqrt(v))
u[, approxbayes := pp(mu, v)]
cnt <- c(1:5, 10, 25, 50)
ggplot(u, aes(x=approxbayes, y=postprob)) +
  stat_binhex(aes(fill=cut2(after_stat(count), cnt)), bins=75) +
  geom_abline(alpha=0.3) +
  guides(fill=guide_legend(title='Frequency')) +
  xlab('Approximate Posterior Probability OR < 1') +
  ylab('Exact Posterior Probability')

h <- function(x) c(Power = round(mean(x), 3))
u[, rbind('Frequentist'          = h(pval < 0.05),
          'Bayesian'             = h(postprob > 0.975),
          'Approximate Bayesian' = h(approxbayes > 0.975))]
Frequentist          0.795
Bayesian             0.782
Approximate Bayesian 0.780

The posterior distribution is extremely well approximated by a normal distribution at a sample size that matters. Bayesian power is just slightly below frequentist power because frequentist test statistics do not place limits on the treatment effect, e.g., allow the odds ratio to be unrealistically far from 1.0.

Use the approximation for proportional odds model Bayesian power going forward. Compute power for the primary endpoint’s sample size.

R  <- expand.grid(n = Nprime, rep=1:1000)
g <- function() {
  R[, as.list(sim1(n, phigh, or, bayes=FALSE)), by=.(n, rep)]
S <- runifChanged(g, R, or, sim1, file='k3n.rds')
S[, pphi := 1 * (pp(mu, v) > 0.975), by=.(n, rep)]
Infpower <- S[, mean(pphi)]
[1] 0.811

The Bayesian power to detect an OR of 0.65 at a sample size of 339 per group is 0.811.

Bayesian Power for Different Treatment Effect for Infection and Death

The proportional odds (PO) model can provide a correct assessment of which treatment is better overall even when PO is not satisfied. But with non-PO the power will not be optimal, and estimates of outcome probabilities for individual outcomes can be off. To estimate Bayesian power when PO is not assumed, allow the treatment effect (OR) for I|D (infection unioned (“or”d) with death) to differ from the OR for D. In the \(Y=0,1,2\) notation we have one OR for \(Y\geq 1\) and a separate OR for \(Y=2\). This is the partial PO model of Peterson & Harrell (1990). The amount of borrowing for the I|D treatment effect and the D effect is specified by a prior for the ratio of the I|D and D ORs. We assume the log of the ratio of the ORs is normal with mean zero and variance such that the probability that the ratio of ratios is between \(\frac{1}{3}\) and \(3\) is 0.95. The standard deviation accomplishing this is \(\frac{\log(3)}{1.96} = 0.5605\).

Also simulate the power of the treatment comparison on I|D when there is no borrowing of treatment effect information across I and D, i.e., when the prior for the non-PO effect (here the special effect of treatment on D) is non-informative. This partial PO model assumes nothing about how the treatment effect on I|D compares to the effect on D.

The sample size is 302 per group to yield Bayesian power of 0.8 against OR=0.65 in PO holds. For this sample size, simulations showed the loss of power by not assuming PO was minimal when the data satisfied PO. So the simulations are run with \(n=75\) per group to better understand power implications for varying amounts of information borrowing.

The R rmsb package’s blrm function has the user specify priors for departures from PO through contrasts and through a “departure from PO” function (the cppo argument). The contrast for the differential treatment effect over outcome components in the information-borrowing case is the same one that we apply to the main treatment effect except for using a standard deviation involving \(\log(3)\) instead of \(\log(4)\).

When the statistical analysis plan for the study is detailed, more simulations will be run that use a non-PO data generating mechanism, but for now data will be simulated the same as earlier, assuming PO for treatment.

sim1npo <- function(n, pcontrol, or, orcut=1, peffect=FALSE) {
  # n: number of pts in each group
  # pcontrol: vector of outcome probabilities for control group
  # or: odds ratio to detect for I|D
  # orcut: Compute Pr(OR < orcut)
  # Set peffect=TRUE to debug

  # Compute outcome probabilities for new treatment
  ptr <- pomodm(p=pcontrol, odds.ratio=or)
  k <- length(pcontrol) - 1
  tx <- c(rep(0, n), rep(1, n))
  y  <- c(sample(0 : k, n, replace=TRUE, prob=pcontrol),
          sample(0 : k, n, replace=TRUE, prob=ptr) )
  # Set prior on the treatment contrast
  pcon <- list(sd=0.7073, c1=list(tx=1), c2=list(tx=0),
               contrast=expression(c1 - c2) )
  # PO model first
  f <- blrm(y ~ tx, pcontrast=pcon,
            loo=FALSE, method='sampling', iter=1000)
  g <- blrm(y ~ tx, 
            y ~ tx, cppo=function(y) y == 2,
            loo=FALSE, method='sampling', iter=1000)

  # Set prior on the "treatment by Y interaction" contrast
  npcon    <- pcon
  npcon$sd <- 0.5605
  h <- blrm(y ~ tx, 
            y ~ tx, cppo=function(y) y == 2,
            pcontrast=pcon, npcontrast=npcon,
            loo=FALSE, method='sampling', iter=1000)
  if(peffect) {
    print(rbind(po=c(coef(f), NA), flat=coef(g), skeptical=coef(h)))
  df <- f$draws[, 'tx']     # log OR for I|D
  dg <- g$draws[, 'tx']     # = log OR for D for PO
  dh <- h$draws[, 'tx']
  # Pr(beta < 0) = Pr(OR < 1)
  lor <- log(orcut)
  c(po = mean(df < lor), flat = mean(dg < lor), skeptical = mean(dh < lor))

g <- function() {
  nsim <- 1000
  # Function to do simultions on one core
  run1 <- function(reps, showprogress, core) {
    R <- matrix(NA, nrow=reps, ncol=3,
                dimnames=list(NULL, c('po', 'flat', 'skeptical')))
    for(i in 1 : reps) {
      showprogress(i, reps, core)
      R[i,] <- sim1npo(N, phigh, 0.65)
  # Run cores in parallel and combine results
  # 3 cores x 4 MCMC chains = 12 cores = machine maximum
  runParallel(run1, reps=nsim, seed=11, cores=3, along=1)
N <- 75
R  <- runifChanged(g, sim1, N, file='k3npo.rds')
bp <- apply(R, 2, function(x) mean(x >= 0.975))
       po      flat skeptical 
    0.228     0.203     0.208 

For \(n=75\) per group, the Bayesian power is 0.228 when completely borrowing information from D to help inform the treatment effect estimate for D|I. Allowing for arbitrarily different effects (non-PO) reduces the power to 0.203, and assuming some similarity of effects yields a power of 0.208.

9.6.4 Bayesian Power for 6-Level Ordinal Non-Inferiority Outcome

Suppose that the 6-level ECOG performance status assessed at 30 days is another secondary endpoint, and is considered to be a non-inferiority endpoint. Assessment of non-inferiority is more direct with a Bayesian approach and doesn’t need to be as conservative as a frequentist analysis, as we are assessing evidence for a broader range of treatment effects that includes slight harm in addition to the wide range of benefits.

The outcome will be analyzed using a Bayesian proportional odds model with the same prior on the log odds ratio as has been used above. The 6 cell probabilities assumed for the control group appear in the top line in the table below.

The estimated mortality by 30d is 0.224.

# Start with relative probabilities of Y=0-4 and make them
# sum to one minus 30d mortality
phigh <- c(42, 41, 10, 1, 1) / 100
phigh <- c(phigh / sum(phigh) * (1. - m30), m30)
or    <- 0.65
pmid  <- pomodm(p=phigh, odds.ratio=sqrt(or))
plow  <- pomodm(p=pmid, odds.ratio=sqrt(or))
n     <- round(posamsize(pmid, odds.ratio=or, power=0.8)$n / 2)
Perfpowf <- popower(pmid, odds.ratio=or, n = Nprime * 2)$power
g <- function(x) {
  names(x) <- paste0('Y=', (1 : length(x)) - 1)
  round(x, 3)
rbind('Standard treatment' = g(phigh),
      'New treatment'      = g(plow),
      'Average'            = g(pmid) )
                     Y=0   Y=1   Y=2   Y=3   Y=4   Y=5
Standard treatment 0.343 0.335 0.082 0.008 0.008 0.224
New treatment      0.446 0.319 0.065 0.006 0.006 0.158
Average            0.393 0.330 0.074 0.007 0.007 0.189

The second line in the table are the outcome probabilities when an odds ratio of 0.65 is applied to the standard treatment probabilities. The last line is what is given to the posampsize function to compute the sample size need for an \(\alpha=0.05\) two-sided Wilcoxon / proportional odds test for differences in the two groups on the 6-level outcome severity with power of 0.8. This is also used for popower. The needed sample size is 283 per group were a frequentist test to be used. This is for a superiority design. The frequentist power for 339 per group is 0.864524.

For performance status a non-inferiority design is used, so Bayesian power will differ more from frequentist superiority power, as the Bayesian posterior probability of interest is \(\Pr(\text{OR} < 1.2)\). Simulate Bayesian power using the normal approximation.

g <- function() {
  or <- 0.65
  R  <- expand.grid(n = Nprime, rep=1:1000)
  R[, as.list(sim1(n, phigh, or, orcut=1.2, bayes=FALSE)), by=.(n, rep)]
S <- runifChanged(g, sim1, file='k6.rds')   # 16s
pp <- function(mu, v) pnorm((log(1.2) - mu) / sqrt(v))
S[, pphi := 1 * (pp(mu, v) > 0.975), by=.(n, rep)]
Perfpowb <- S[, mean(pphi)]

Bayesian power to detect an OR of 0.65 when seeking evidence for OR < 1.2 with 339 patients per group is 0.987.

9.7 Sample Size Needed For Prior Insensitivity

One way to choose a target sample size is to solve for \(n\) that controls the maximum absolute difference in posterior probabilities of interest, over the spectrum of possible-to-observe data. Let

  • \(n\) = sample size in each of two treatment groups
  • \(\Delta\) = unknown treatment effect expressed as a difference in means \(\mu_{1} - \mu_{2}\)
  • the data have a normal distribution with variance 1.0
  • \(\hat{\Delta}\) = observed difference in sample means \(\overline{Y}_{1} - \overline{Y}_{2}\)
  • \(V(\hat{\Delta}) = \frac{2}{n}\)
  • Frequentist test statistic: \(z = \frac{\hat{\Delta}}{\sqrt{\frac{2}{n}}}\)
  • \(\delta\) assumed to be the difference not to miss, for frequentist power
  • One-sided frequentist power for two-sided \(\alpha\) with \(z\) critical value \(z_{\alpha}\): \(\Pr(z > z_{\alpha}) = 1 - \Phi(z_{\alpha} - \frac{\delta}{\sqrt{\frac{2}{n}}})\)

Power curves are shown below for \(\alpha=0.05\).

d <- expand.grid(n=c(1:19, seq(20, 200, by=5)),
                 delta = seq(0, 2, by=0.25))
ggplot(d, aes(x=n, y=1 - pnorm(qnorm(0.975) - delta / sqrt(2 / n)), color=factor(delta))) +
  geom_line() + guides(color=guide_legend(title=expression(delta))) +
  scale_x_continuous(minor_breaks=seq(0, 200, by=10)) +
  scale_y_continuous(minor_breaks=seq(0,   1, by=0.05)) +
  xlab('n') + ylab(expression(paste('Power at ', alpha==0.05)))

  • Assume skeptical prior for \(\Delta\) that is normal with mean 0 and variance \(\frac{1}{\tau}\)
  • Posterior distribution for \(\Delta | \hat{\Delta}\) is normal
    • variance = \(\frac{1}{\tau + \frac{n}{2}}\)
    • mean = \(\frac{n\hat{\Delta}}{n + 2\tau}\)
pp <- function(tau, n, delta, Deltahat)
  round(1 - pnorm((delta - n * Deltahat / (n + 2 * tau)) * sqrt(tau + n / 2)), 4)
# Compute prior SD such that P(Delta > delta) = prob
psd <- function(delta, prob) round(delta / qnorm(1 - prob), 4)

  • Example: Probability of misleading evidence for efficacy
    • agreed-upon prior was such that \(\Pr(\Delta > 0.5) = 0.05 \rightarrow\) SD of the prior is 0.304
    • someone disagrees, and has their own posterior probability of efficacy based on their prior, which was chosen so that \(\Pr(\Delta > 0.5) = 0.005 \rightarrow\) SD of their prior is 0.1941
    • suppose that \(\hat{\Delta}\) at \(n=100\) is 0.3
    • \(\rightarrow\) posterior probability of \(\Delta > 0\) using pre-specified prior is 0.9801
    • \(\rightarrow\) posterior probability of \(\Delta > 0\) using the critic’s prior is 0.9783
      • there is substantial evidence for efficacy in the right direction
      • probability of being wrong in acting as if the treatment works is 0.0199 to those who designed the study, and is 0.0217 to the critic
    • the two will disagree more about evidence for a large effect, e.g. \(\Pr(\Delta > 0.2)\)
    • \(\Pr(\Delta > 0.2)\) using pre-specified prior: 0.724
    • \(\Pr(\Delta > 0.2)\) for the critic: 0.7035
    • Probabilities of being wrong in acting as if the treatment effect is large: 0.276, 0.2965
  • Sample size needed to overcome effect of prior depends on the narrowness of the prior (\(\tau; \sigma\))
    • for posterior mean and variance to be within a ratio \(r\) of those from a flat prior \(n\) must be \(> \frac{2\tau r}{1 - r}\)
    • within 5% for \(n > 38\tau\)
    • within 2% for \(n > 98\tau\)
    • within 1% for \(n > 198\tau\)
    • typical value of \(\sigma = \frac{1}{\tau} = 0.5\) so \(n > 196\) to globally be only 2% sensitive to the prior
  • Posterior \(\Pr(\Delta > \delta | \hat{\Delta}) = 1 - \Phi((\delta - \frac{n\hat{\Delta}}{n + 2\tau})\sqrt{\tau + \frac{n}{2}})\)
  • Choose different values of the prior SD to examine sensitivity of cumulative posterior distribution to prior
  • Below see that the posterior probability of any treatment benefit (when \(\delta=0\)) is not sensitive to the prior
  • Probability of benefit \(> \delta\) for large \(\delta\) is sensitive to the prior
    • priors having large probability of large \(\Delta\) lead to higher posterior probabilities of large \(\Delta > \delta\) when \(\delta\) is large
  • For each \(n, \delta\), and SDs compute the absolute difference in posterior probabilities that \(\Delta > \delta\)
  • First look at the result when assessing evidence for any efficacy (\(\delta=0\)) when the most extreme priors (\(\sigma=\frac{1}{8}\) vs. \(\sigma=8\)) are compared
  • Don’t consider conditions when both posterior probabilities are in [0.1, 0.9] where disagreements are probably inconsequential
sigmas <- c(1/8, 8)
w <- expand.grid(n        = c(1:30, seq(35, 200, by=5)),
                 Deltahat = seq(-1, 5, by=0.025))
z <- lapply(1 : nrow(w),
            function(i) w[i, .(n, Deltahat,
                               p1 = pp(8,   n, 0, Deltahat),
                               p2 = pp(1/8, n, 0, Deltahat) )] )
u <- rbindlist(z)
# Ignore conditions where both posterior probabilities are in [0.1, 0.9]
# Find maximum abs difference over all Deltahat
m <- u[, .(mdif = max(abs(p1 - p2) * (p1 < 0.1 | p2 < 0.1 | p1 > 0.9 | p2 > 0.9))), by=n]
ggplot(m, aes(x=n, y=mdif)) + geom_line() +
  scale_x_continuous(minor_breaks=seq(0, 200, by=10)) +
  scale_y_continuous(minor_breaks=seq(0, 0.2, by=0.01)) +
  ylab(expression(paste('Maximum |Difference| in ', Pr(Delta > delta ~ "|" ~ hat(Delta))))) 

  • By \(n=170\) the largest absolute difference in posterior probabilities of any efficacy is 0.01

  • The difference is 0.02 by \(n=80\)

  • Sample sizes needed for frequentist power of 0.9 are \(n=38\) for \(\Delta=0.5\) and \(n=85\) for \(\Delta=0.25\)

  • Now consider the full picture where evidence for efficacy \(> \delta\) where \(\delta \neq 0\) is included

  • Vary \(n, \delta, \hat{\Delta}\) and the two prior SDs

  • Don’t show points where the difference < 0.0025 or when both posterior probabilities are in [0.1, 0.9]

  • In the graph one of the prior SDs is going across, and the other is vertical; diagonals where the two SDs are equal are not needed

# For a given n and delta two values of SDs compute the absolute difference
# of the two posterior probabilities that Delta > delta when one of the posterior
# probabilities was < 0.1 or > 0.9 and when the difference exceeds 0.0025
adif <- function(n, delta, s1, s2) {
  tau1 <- 1 / s1
  tau2 <- 1 / s2
  Deltahat <- seq(-1, 5, by=0.05)
  p1 <- pp(tau1, n, delta, Deltahat)
  p2 <- pp(tau2, n, delta, Deltahat)
  i  <- abs(p1 - p2) >= 0.0025 & (pmin(p1, p2) < 0.1 | pmax(p1, p2) > 0.9)
  data.table(n, delta, s1, s2, Deltahat, ad=abs(p1 - p2))[i]
taus <- c(1/8, 1/4, 1/2, 1, 2, 4, 8)
sigmas <- 1 / taus
dels   <- seq(-1, 1, by=0.5)
w <- expand.grid(n     = c(10, 15, seq(20, 400, by=10)),
                 delta = dels,
                 s1    = sigmas,
                 s2    = sigmas)
w <- w[s2 > s1]
z <- lapply(1 : nrow(w), function(i) w[i, adif(n, delta, s1, s2)])
u <- rbindlist(z)
u[, adc := cut2(ad, c(0, 0.005, 0.01, 0.02, 0.03, 0.04, 0.05, 0.075, 0.1, 0.15, 0.2, 0.4))]
g <- vector('list', length(dels))
names(g) <- paste0('δ=', dels)
for(del in dels)
  g[[paste0('δ=', del)]] <- ggplot(u[delta==del], aes(x=n, y=Deltahat, color=adc)) + geom_point(size=0.3) +
    facet_grid(s1 ~ s2) + guides(color=guide_legend(title='|Difference|')) +
    theme(legend.position='bottom') +
    ylab(expression(hat(Delta))) +
    labs(caption=bquote(delta == .(del)))
maketabs(g, initblank=TRUE)
  • “Mouth” from the left of each panel corresponds to regions where both posterior probabilities are in [0.1, 0.9]
  • Most extremely different priors examined had SD=0.25 (skeptical of large \(\Delta\)) vs. SD=8 (large \(\Delta\) are likely)
  • For \(\delta=0\) differences vanish for \(n > 200\)
  • For \(\delta=-1\) differences are greatest when observed differences in means \(\hat{\Delta} = -1\)
  • For \(\delta=1\) differences are greatest when \(\hat{\Delta} = 1\)
  • Required \(n=400\) for influence of prior to almost vanish everywhere when the most different priors are paired
  • More realistic to compare somewhat skeptical SD=1 with skeptical SD=0.25
  • Larger discrepancy when \(\delta=-1\) and \(\hat{\Delta}=-1\)
  • \(n > 300\) makes for small discrepancy

Another display of the same conditions except for fewer values of \(\hat{\Delta}\).

w <- expand.grid(n        = c(10, 15, seq(20, 400, by=10)),
                 delta    = seq(-1, 1, by=0.5),
                 s        = sigmas,
                 Deltahat = c(-1, -0.5, 0, 0.25, 0.5, 0.75, 1, 1.5))
z <- lapply(1 : nrow(w), function(i) w[i, .(n, delta, s, Deltahat, pp=pp(1/s, n, delta, Deltahat))])
u <- rbindlist(z)
ggplot(u, aes(x=n, y=pp, color=factor(s))) + geom_line() +
  facet_grid(paste0('delta==', delta) ~ paste('hat(Delta)==', Deltahat), label='label_parsed') +
  guides(color=guide_legend(title=expression(paste('Prior ', sigma)))) +
  ylab(expression(Pr(Delta > delta ~ "|" ~ hat(Delta)))) +

See Rigat (2023) for related ideas.

9.8 Sequential Monitoring and Futility Analysis

I want to make it quite clear that I have no objection with flexible designs where they help us avoid pursuing dead-ends. The analogous technique from the sequential design world is stopping for futility. In my opinion of all the techniques of sequential analysis, this is the most useful. — Senn (2013)

  • Compute PP as often as desired, even continuously
  • Formal Bayesian approach can also consider having adequate safety data
    • Stop when PP > 0.95 for any efficacy and PP > 0.95 for safety event hazard ratio ≤ 1.2
  • Spiegelhalter et al. (1993) argue for use of ‘range of equivalence’ in monitoring
  • Bayesian design is very simple if no cap on n
  • But often reach a point where clinically important efficacy unlikely to be demonstrated within a reasonable n
  • Futility is more easily defined with Bayes. 3 possible choices:
    • Stop when P(efficacy < ε) is > 0.9
    • Use posterior predictive distribution at planned max sample size (Berry (2006))
    • Pre-specify max n and continue until that n if trial not stopped early for efficacy or harm (more expensive)
  • Insightful approximation of Spiegelhalter et al. (1993)
    • use Z-statistic at an interim look
    • assume fraction f of subjects randomized to date
    • based only on observables
pf <- function(z, f) pnorm(z/sqrt(1 - f) - 1.96 * sqrt(f) / sqrt(1 - f))
zs <- seq(-1, 3, length=200)
fs <- c(.1, .25, .4, .75, .9)
d <- expand.grid(Z=zs, f=fs)
f <- d$f
d$p <- with(d, pf(Z, f))
d$f <- NULL
p <- split(d, f)
labcurve(p, pl=TRUE, ylab='Predictive Probability')

Predictive probability of the final 0.95 credible interval excluding zero, and the treatment effect being in the right direction, given the fraction f of study completed and the current test statistic Z when the prior is flat. f values are written beside curves.
  • Example: to have reasonable hope of demonstrating efficacy if an interim Z value is 1.0, one must be less than 1/10th of the way through the trial

  • For example simulations of futility see this

The published report of any study attempts to answer the crucial question: What does this trial add to our knowledge? The strength of the Bayesian approach is that it allows one to express the answer formally. It therefore provides a rational framework for dealing with many crucial issues in trial design, monitoring and reporting. In particular, by making explicitly different prior opinions about efficacy, and differing demands on new therapies, it may shed light on the varying attitudes to balancing the clinical duty to provide the best treatment for individual patients, against the desire to provide sufficient evidence to convince the general body of clinical practice. — Spiegelhalter et al. (1993)

9.9 Example Operating Characteristics

Even with a disadvantageous mismatch between the prior used in the analysis and the prior used to generate the universe of efficacies, the design was able to stop early for evidence of non-trivial efficacy in 0.401 of 10,000 trials over a skeptical distribution of true efficacies, and was right 0.91 of the time when doing so. The design was right 0.98 of the time in terms of there being truly some efficacy when stopping for more than trivial efficacy. Stopping early for evidence of less than trivial efficacy was possible in 0.518 of the trials and was the correct decision 0.97 of the time. One quarter of such trials stopped with fewer than 14 participants studied and half had a sample size less than 32. Overall the average sample size at the time of stopping was 138 for the most skeptical prior, which is just over ½ what a traditional frequentist design would require (234) just to have power to detect the MCID, still not providing evidence for non-trivial efficacy, inefficacy, or similarity. With the Bayesian sequential design only 0.024 of the trials ended without a conclusion for the most skeptical prior.

For the Bayesian sequential design at the start of this chapter, simulate the operating characteristics for the following setting.

  • Two-arm parallel-group RCT with equal sample sizes per group
  • Response variable is normal with standard deviation 1.0
  • Estimand of interest is \(\Delta\) = the unknown difference in means
  • MCID is \(\delta=0.3\)
  • Minimum clinically detectable difference is \(\gamma=0.1\)
  • Similarity interval is \(\Delta \in [-0.15, 0.15]\)
  • Maximum affordable sample size is 750 per treatment group
  • Required minimum precision in estimating \(\Delta\): \(\pm 0.4\) (half-width of 0.95 credible interval)
    • \(\rightarrow\) minimum sample size \(N_{p}\) before stopping for efficacy
    • Allow stopping for inefficacy at any time, even if the treatment effect cannot be estimated after stopping, i.e., one only knows there is a high probability that the effect \(< \gamma\).
  • The first analysis takes place after the first participant has been enrolled and followed, then after each successive new participant’s data are complete. The first look for efficacy is delayed until the \(N_{p}^\text{th}\) participant has been followed so that the magnitude of the treatment effect has some hope of being estimated.
  • Prior for \(\Delta\) is normal with mean 0 and SD \(\sigma\) such that \(\Pr(\Delta > 0.75) = \kappa\) where \(\kappa\) is varied over \({0.2, 0.1, 0.05, 0.025}\)
    • This corresponds to \(\sigma=\) 0.8911, 0.5852, 0.456, 0.3827
    • These are skeptical priors, chosen because the treatment is assumed to be an incremental rather than a curative therapy
With these similarity limits a large number of studies were stopped early for similarity. In one test with a narrower similarity interval, 0.035 of simulated studies were stopped early for similarity.When using an ultra-high information continuous response variable as in this example, with the between-subject standard deviation known to be 1.0, the data having a normal distribution, and not delaying the first look, it is possible to make a decision with any one participant if she has a very impressive response. This could never happen with a binary endpoint or with unknown SD.

The assumed data model leads to a frequentist point estimate of \(\Delta\) = \(\hat{\Delta}\) = difference in two means, with variance \(\frac{2}{n}\), where \(n\) is the per-group sample size. The one-sided frequentist power for a two-tailed test with \(\alpha=0.05\) is \(1 - \Phi(1.96 - \frac{0.3}{\sqrt{\frac{2}{n}}})\), and setting this to 0.9 requires \(n=2[\frac{1.96 + 1.282}{0.3}]^2 = 234\). 1.282 is \(\Phi^{-1}(0.9)\).

To arrive at \(N_{p}\), the minimum sample size needed to achieve non-embarrassing precision in estimating \(\Delta\), keep in mind that skeptical priors for \(\Delta\) push the posterior distribution for \(\Delta\) towards zero, resulting in narrower posterior distributions. To be conservative in estimating \(N_{p}\) use a flat prior. The 0.90 credible interval is \(\hat{\Delta} \pm 1.645\sqrt{\frac{2}{n}}\). Setting the half-length of the interval to \(\delta=0.3\) results in \(n = N_{p} = 2(\frac{1.645}{0.3})^2 = 60\) per group.

Were a margin of error of 0.2 been required, \(N_{p}=135\), i.e., the first look for possible stoppage for efficacy would be at the \(135^\text{th}\) participant. Setting the required precision to 0.085 would result in no early stopping for efficacy, which is a choice that some trial designers may make, but it requires massive justification for the choice of the maximum sample size.

At a given interim look with a cumulative difference in means of \(\hat{\Delta}\) the posterior distribution of \(\Delta | \hat{\Delta}\) is normal with variance \(\frac{1}{\tau + \frac{n}{2}}\) and mean \(\frac{n\hat{\Delta}}{n + 2\tau}\) where \(\tau\) is the prior precision, \(\frac{1}{\sigma^2}\). Here is a function to compute the posterior probability that \(\Delta > \delta\) given the current difference in means \(\hat{\Delta}\).

pp <- function(delta, Deltahat, n, kappa) {
  sigma <- psd(0.75, kappa)
  tau   <- 1. / sigma ^ 2
  v     <- 1. / (tau + n / 2.)
  m     <- n * Deltahat / (n + 2. * tau)
  1. - pnorm(delta, m, sqrt(v))

A simple Bayesian power calculation for a single analysis after \(n\) participants are enrolled in each treatment group, and considering only evidence for any efficacy, sets the probability at 0.9 that the posterior probability of \(\Delta > 0\) exceeds 0.95 when \(\Delta=\delta\). The posterior probability PP that \(\Delta > 0\) is \(\Phi(\frac{n\hat{\Delta}/\sqrt{2}}{\sqrt{n + 2\tau}})\). Making use of \(\hat{\Delta} \sim n(\delta, \frac{2}{n})\), the probability that PP > 0.95 is \(1 - \Phi(\frac{za - \delta}{\sqrt{\frac{2}{n}}})\) where \(z = \Phi^{-1}(0.95)\) and \(a=\frac{\sqrt{2}\sqrt{n + 2\tau}}{n}\).

# Special case when gamma=0, for checking
bpow <- function(n, delta=0.3, kappa=0.05) {
  tau <- 1. / psd(0.75, kappa) ^ 2
  z <- qnorm(0.95)
  a <- sqrt(2) * sqrt(n + 2 * tau) / n
  1. - pnorm((z * a - delta) / sqrt(2 / n))
bpow <- function(n, delta=0.3, gamma=0, kappa=0.05, ppcut=0.95) {
  tau <- 1. / psd(0.75, kappa) ^ 2
  z <- qnorm(ppcut)
  a <- n + 2. * tau
  1. - pnorm(sqrt(n / 2) * ((a / n) * (gamma + z * sqrt(2 / a) ) - delta))
findx <- function(x, y, ycrit, dec=0) round(approx(y, x, xout=ycrit)$y, dec)
R$bayesN <- round(findx(10:500, bpow(10:500), 0.9))
[1] 0.9006247

Setting the Bayesian power to 0.9 results in \(196\) when \(\kappa=0.05\). Contrast this with the frequentist sample size of 234.

To understand the stringency of adding the requirement that efficacy is non-trivial to the requirement of having a very high probability that any efficacy exists, consider a sample size (45) and current mean difference (0.3) where the probability of any efficacy is 0.9. Compute the probability of non-trivial efficacy with that mean difference. The prior probability of a very large (0.75) or larger treatment effect is taken as \(\kappa=0.05\).

pp(  0,  0.3, 45, 0.05)
[1] 0.9017632
pp(0.1,  0.3, 45, 0.05)
[1] 0.7790777
pp(  0, 0.36, 45, 0.05)
[1] 0.9394288
pp(0.1, 0.36, 45, 0.05)
[1] 0.8478874

At the point at which there is over 0.9 chance of any efficacy in the right direction, the probability of non-trivial efficacy is 0.78. To achieve a probability of 0.85 would require an observed mean difference of 0.36 which raises the probability of any efficacy to 0.94.

What if instead we wanted 0.9 Bayesian power to achieve a PP > 0.85 that \(\Delta > \gamma\), the threshold for a detectable treatment effect? Let’s compute over a range of \(n\) the two relevant probabilities: of getting \(\Pr(\Delta > 0) > 0.95\) and of getting \(\Pr(\Delta > 0.10) >0.85\), and also compute Bayesian power for \(\Pr(\Delta > 0.15) > 0.8\).

gammas <- c(0,      .1, .15)
ppcuts <- c(0.95, 0.85, 0.8)
w <- expand.grid(crit=1:3, n=seq(10, 750, by=5))
w[, Criterion := paste0('Pr(efficacy > ', gammas[crit], ') > ', ppcuts[crit])]
w[, p := bpow(n=n, gamma=gammas[crit], ppcut=ppcuts[crit])]
ggplot(w, aes(x=n, y=p, col=Criterion)) + geom_line() +
  scale_y_continuous(breaks=seq(0, 1, by=0.1)) +
  ylab('Bayesian Power') +

# Solve for n to get Bayesian power of 0.9
p <- w[, .(n=findx(n, p, 0.9)), by=.(crit, Criterion)]
navg <- if(length(R$stopn)) 
  R$stopn[conclusion == 'non-trivial efficacy' & kappa==0.001, Mean]

For a fixed sample-size study the necessary \(N\) to have a 0.9 chance that the posterior probability of more than trivial efficacy exceeds 0.85 is 282.

Later in this section the expected sample size to establishment of non-trivial efficacy will be estimated when continuous sequential analysis is allowed. This number is 122 for the most skeptical prior used. The more general expected sample size is that required on the average to reach any conclusion. This number is 138.

9.9.1 Decision Thresholds

Before simulating the trials, let’s compute the decision thresholds, i.e., the least impressive values of the observed difference in means \(\hat{\Delta}\) that would trigger a given conclusion, as a function of the current sample size. Let \(\hat{\Delta}\) be a regular sequence of values for each sample size. For the current purpose don’t delay any efficacy looks, and treat each of the three reasons for stopping independently. For inefficacy, the \(\hat{\Delta}\) threshold is the highest \(\hat{\Delta}\) leading to stopping for inefficacy. For non-trivial efficacy, this is the lowest \(\hat{\Delta}\) leading to stopping for non-trivial effectiveness. For similarity the threshold is an interval with the lower value being the minimum \(\hat{\Delta}\) meeting the similarity criterion and the upper limit is the maximum \(\hat{\Delta}\) meeting the criterion.

Do the calculations for one \(\kappa\).

difs <- seq(-1.5, 1.5, by=0.001)
ns   <- 1 : 750
d <- expand.grid(dif=difs, n=ns)
thresh <- function(Deltahat, n, kappa=0.05) {
  # For an ascendeing Deltahat vector and single value of n
  # Compute posterior probability of any efficacy
  p1 <- pp(0., Deltahat, n, kappa)
  # Compute probability of non-trivial efficacy
  p2 <- pp(0.1, Deltahat, n, kappa)
  # Compute probability of similarity
  p3 <- pp(-0.15, Deltahat, n, kappa) -
        pp( 0.15, Deltahat, n, kappa)

  ineff  <- max(Deltahat[p2 < 0.025])
  eff    <- min(Deltahat[p1 > 0.95 & p2 > 0.85])
  simlow <- min(Deltahat[p3 > 0.95])
  simhi  <- max(Deltahat[p3 > 0.95])
  x <- c(ineff, eff, simlow, simhi)
  x[is.infinite(x)] <- NA
  list(conclusion=c('inefficacy', 'non-trivial efficacy', 'similarity-low', 'similarity-high'),
w <- d[, thresh(dif, n, kappa=0.05), by=n]
R$threshold <- w
w[n %in% c(50, 100, 200, 750)]
        n           conclusion Deltahat
    <int>               <char>    <num>
 1:    50           inefficacy   -0.309
 2:    50 non-trivial efficacy    0.360
 3:    50       similarity-low       NA
 4:    50      similarity-high       NA
 5:   100           inefficacy   -0.181
 6:   100 non-trivial efficacy    0.264
 7:   100       similarity-low       NA
 8:   100      similarity-high       NA
 9:   200           inefficacy   -0.096
10:   200 non-trivial efficacy    0.211
11:   200       similarity-low       NA
12:   200      similarity-high       NA
13:   750           inefficacy   -0.001
14:   750 non-trivial efficacy    0.156
15:   750       similarity-low   -0.066
16:   750      similarity-high    0.066
ggplot(w, aes(x=n, y=Deltahat, col=conclusion)) + geom_line() +
  scale_x_continuous(minor_breaks=seq(0, 675, by=25)) +
  scale_y_continuous(breaks=seq(-1.5, 1.5, by=0.5),
                     minor_breaks=seq(-1.5, 1.25, by=0.1)) +
  ylab(expression(hat(Delta))) +
  labs(title='Decision thresholds in terms of observed differences in means')

sim.min   <- w[conclusion == 'similarity-low' & ! is.na(Deltahat), min(n)]
sim.max   <- w[conclusion == 'similarity-high' & n == 750, Deltahat]
ineff100  <- w[conclusion == 'inefficacy' & n == 100, Deltahat]
ineff200  <- w[conclusion == 'inefficacy' & n == 200, Deltahat]
nte50     <- w[conclusion == 'non-trivial efficacy' & n == 50, Deltahat]
nte200    <- w[conclusion == 'non-trivial efficacy' & n == 200, Deltahat]
nte.min   <- w[conclusion == 'non-trivial efficacy' & n == 750, Deltahat]

A few observations are in order.

  • Similarity can’t be concluded until \(n > 332\)
  • Even then the observed difference needs to be almost exactly zero to make that conclusion
  • At the maximum sample size \(n=750\) similarity is concluded if the absolute value of the observed \(\hat{\Delta}\) is 0.066 or less (recall that similarity was defined as \(\Delta \in [-0.15, 0.15]\))
  • An inefficacy conclusion is possible if \(\hat{\Delta} < -0.181\) when \(n=100\) and if \(\hat{\Delta} < -0.096\) when \(n=200\)
  • A non-trivial efficacy conclusion is possible if \(\hat{\Delta} > 0.36\) when \(n=50\) or when \(\hat{\Delta} > 0.211\) when \(n=200\)
  • When \(\hat{\Delta} = 0.156\) then planned maximum sample size of \(n=750\) is required to be able to conclude non-trivial efficacy. Recall that the MCID is 0.3 and the threshold for trival effect is \(\Delta=0.1\).

Since the observed treatment effect \(\hat{\Delta}\) has variance \(\frac{2}{n}\) with \(n\) participants per group, the \(\hat{\Delta}\) thresholds are easily converted into the \(z\) scale \(\sqrt{\frac{n}{2}}\hat{\Delta}\).

ggplot(w, aes(x=n, y=Deltahat * sqrt(n / 2), col=conclusion)) + geom_line() +
  scale_x_continuous(minor_breaks=seq(0, 675, by=25)) +
  ylab(expression(z)) +
  labs(title='Decision thresholds on the z-scale')

The patterns are different from what is seen with a frequentist approach, as here we see that when \(n > 50\) the critical value for \(z\) increases. That is because we are demanding at least an 0.85 chance that the efficacy is non-trivial (\(\Delta > \gamma\)). The drop in the upper curve from the lowest \(n\) for which a conclusion of non-trivial efficacy is possible (\(n=7\)) to \(n=50\) is due to the influence of the skeptical prior for \(\Delta\) when \(n\) is small. When a flat analysis prior is used for \(\Delta\), the upturn disappears.

9.9.2 Simulation

Now we simulate 10,000 sequential trials for each \(\kappa\) to evaluate the Bayesian operating characteristics. \(\kappa\) and the true unknown treatment effect \(\Delta\) are varied. Studies are allowed to proceed to an upper limit sample size of 750 in each group. Each study is stopped when any stopping criterion is satisfied.

The first function simulates one complete trial, computing the cumulative summary statistic (difference in means) after 1, 2, …, 750 participants in each treatment group.

sim <- function(Delta, kappa, N=750) {
  # Simulate raw data for whole planned trial, 
  # for each treatment
  y1 <- rnorm(N,    0., 1.)
  y2 <- rnorm(N, Delta, 1.)
  j  <- 1 : N
  ybar1 <- cumsum(y1) / j
  ybar2 <- cumsum(y2) / j
  Deltahat <- ybar2 - ybar1
  Deltahat  # cumulative mean difference

The next function executes the stopping rules for one trial, looking sequentially.

sim1 <- function(Delta, kappa, ignore=59) {
  Deltahat <- sim(Delta, kappa)
  n <- 1 : length(Deltahat)
  # Compute posterior probability of any efficacy
  p1 <- pp(0., Deltahat, n, kappa)
  # Compute probability of non-trivial efficacy
  p2 <- pp(0.10, Deltahat, n, kappa)
  # Compute probability of similarity
  p3 <- pp(-0.15, Deltahat, n, kappa) -
        pp( 0.15, Deltahat, n, kappa)
  # For each decision find first sample size at which criterion was satisfied
  k <- n > ignore
  ineff <- which((p2 < 0.025))[1]
  eff   <- which(k & (p1 > 0.95 & p2 > 0.85))[1]
  sim   <- which(k & (p3 > 0.95))[1]
  u <- function(r, n) list(conclusion=r, n=n)
  # If not stopped, trial went to maximum
  All <- c(ineff, eff, sim)
  if(all(is.na(All))) return(u('none', max(n)))
  first <- min(All, na.rm=TRUE)
  if(! is.na(ineff) && ineff == first) return(u('inefficacy', ineff))
  if(! is.na(eff)   && eff   == first) return(u('non-trivial efficacy', eff))
  u('similarity', sim)

Run all 50,000 trials. For each \(\Delta\) and \(\kappa\) simulate one clinical trial. Do this over a very fine grid of \(\Delta\) so that we can later smooth the individual results and do not need multiple trials for any one value of \(\Delta\). \(\Delta\)s are drawn from a mean zero normal distribution with \(\Pr(\Delta > 0.75) = 0.025\).

Deltas <- qnorm(seq(0.0001, 0.9999, length=10000), sd=psd(0.75, 0.025))
kappas <- c(0.2, 0.1, 0.05, 0.025, 0.01, 0.001)
d <- expand.grid(Delta=Deltas, kappa=kappas)
d[, .q(conclusion, n) := sim1(Delta, kappa), by=.I]  # 7 sec.

Here are the analysis \(\Delta\) priors. The distribution of the universe of efficacies used to generate the clinical trials has \(\kappa=0.025\).

x <- seq(-2.5, 2.5, length=300)
w <- expand.grid(Delta=x, kappa=kappas)
w[, sigma := psd(0.75, kappa)]
w[, density := dnorm(Delta, sd=sigma)]
ggplot(w, aes(x=Delta, y=density,
    col=paste0(kappa, ' (', round(sigma, 2), ')'))) + geom_line() +
  xlab(expression(Delta)) + ylab('') +

For each \(\kappa\) show the frequencies of stopping for the various conclusions.

w <- d[, table(conclusion, kappa)]
conclusion             0.001 0.01 0.025 0.05  0.1  0.2
  inefficacy            4984 5007  5066 5101 5159 5184
  non-trivial efficacy  3895 3973  3997 3968 4017 4010
  none                   242  193   227  194  175  172
  similarity             879  827   710  737  649  634
R$propconc <- round(prop.table(w, margin=2), 3)

The prior skepticism \(\kappa\) has greater impact on stopping for similarity and stopping with no conclusion.

Examine the mean and median stopping time (final N) by reason for stopping and \(\kappa\).

qmean <- function(x) {
  qu <- round(quantile(x, (1:3)/4))
  list(Mean   = round(mean(x)),
       Q1     = qu[1],
       Median = qu[2],
       Q3     = qu[3])
w <- d[, qmean(n), by=.(conclusion, kappa)]
print(w, nrows=100)
              conclusion kappa Mean  Q1 Median  Q3
 1:           inefficacy 0.200   62  13     31  80
 2: non-trivial efficacy 0.200  102  60     60  85
 3:           similarity 0.200  423 347    378 465
 4:                 none 0.200  750 750    750 750
 5:           inefficacy 0.100   65  16     35  82
 6: non-trivial efficacy 0.100  103  60     60  91
 7:           similarity 0.100  421 345    381 466
 8:                 none 0.100  750 750    750 750
 9:           inefficacy 0.050   69  18     38  89
10: non-trivial efficacy 0.050  106  60     60  92
11:           similarity 0.050  413 341    367 446
12:                 none 0.050  750 750    750 750
13:           inefficacy 0.025   70  21     41  90
14:           similarity 0.025  415 337    368 462
15: non-trivial efficacy 0.025  107  60     60  96
16:                 none 0.025  750 750    750 750
17:           inefficacy 0.010   73  23     44  95
18:           similarity 0.010  402 331    357 440
19: non-trivial efficacy 0.010  114  60     60 107
20:                 none 0.010  750 750    750 750
21:           inefficacy 0.001   76  27     49 100
22:           similarity 0.001  388 316    338 428
23: non-trivial efficacy 0.001  122  60     64 130
24:                 none 0.001  750 750    750 750
              conclusion kappa Mean  Q1 Median  Q3
R$stopn <- w

Even with the most skeptical prior of the four priors, more than half of the simulated studies which ultimately stopped for efficacy could be stopped with that conclusion after just 60 participants, the minimum required for precision. We also need to examine the stopping times counting the other reasons. Here are the mean and median number of participants when stopping is for any reason (including going to the maximum sample size of 750).

w <- d[, qmean(n), by=.(kappa)]
   kappa Mean Q1 Median  Q3
1: 0.200  113 30     60 110
2: 0.100  116 33     60 117
3: 0.050  122 37     60 126
4: 0.025  125 40     60 129
5: 0.010  130 44     60 142
6: 0.001  138 49     65 160
R$stopany <- w

For planning purposes the expected sample size may be taken as 138. Sequential analysis without the traditional frequentist limitations on the number of looks makes a large difference in trial efficiency while providing evidence in more dimensions.

Now consider the most important aspect of the performance of the design: the primary Bayesian operating characteristic—correctness of decisions about treatment effect. For our simulation study we judge the alignment between the decision about the treatment effect and the (unknown to us and to the sequential analyses but known to the simulation code) true treatment effect that was in play when the data were simulated. So for each prior and each decision that was made, we estimate the distribution of true treatment effects that were in effect when combining all the simulated trials resulting in that decision. This is the distribution of the \(\Delta\) values conditional on stopping (or going to the sample size limit of 750 for the no-conclusion simulations) and on the reason for stopping. Keep in mind that for the simulations \(\Delta\)s were drawn from \(\kappa=0.025\).

Examine distributions of underlying true \(\Delta\) for each prior and each conclusion (reason for stopping).

del <- c(-0.1, 0, 0.1, 0.3)
ggplot(d, aes(x=Delta, color=conclusion)) + geom_density() +
  facet_wrap(~ kappa, ncol=2) +
  scale_x_continuous(breaks=seq(-1, 1, by=0.5),
                     minor_breaks=seq(-1, 1, by=0.1)) +
  geom_vline(xintercept=del, col='blue', alpha=0.2) +
  xlab(expression(Delta)) + xlim(-1, 1) +

sp <- d[, spikecomp(Delta, tresult='segments'), by=.(conclusion, kappa)]
ggplot(sp) + geom_segment(aes(x=x, y=y1, xend=x, yend=y2, alpha=I(0.3))) +
    scale_y_continuous(breaks=NULL, labels=NULL) + ylab('') +
    facet_grid(conclusion ~ kappa) +
    geom_vline(xintercept=del, col='red', alpha=0.2) +

For a conclusion of non-trivial efficacy, the true effect in play was rarely > 0.2. Let’s look at the actual proportions of true \(\Delta\)s in various intervals vs. the conclusion made.

z <- cube(d, .('Δ<0'         = sum(Delta < 0),
              'Δ<0.1'        = sum(Delta < 0.1),
              '-0.15<Δ<0.15' = sum(Delta >= -0.15 & Delta <= 0.15),
              'Δ>0.1'        = sum(Delta > 0.1),
              'Δ>0'          = sum(Delta > 0)),
  by=.q(kappa, conclusion))
    kappa           conclusion   Δ<0 Δ<0.1 -0.15<Δ<0.15 Δ>0.1   Δ>0
 1: 0.200           inefficacy  4670  5020         1633   164   514
 2: 0.200 non-trivial efficacy    97   367          663  3643  3913
 3: 0.200           similarity   224   555          607    79   410
 4: 0.200                 none     9    89          147    83   163
 5: 0.100           inefficacy  4659  5012         1622   147   500
 6: 0.100 non-trivial efficacy    98   380          677  3637  3919
 7: 0.100           similarity   235   554          612    95   414
 8: 0.100                 none     8    85          139    90   167
 9: 0.050           inefficacy  4649  4973         1571   128   452
10: 0.050 non-trivial efficacy    72   351          637  3617  3896
11: 0.050           similarity   273   624          699   113   464
12: 0.050                 none     6    83          143   111   188
13: 0.025           inefficacy  4644  4955         1544   111   422
14: 0.025           similarity   270   614          673    96   440
15: 0.025 non-trivial efficacy    81   370          656  3627  3916
16: 0.025                 none     5    92          177   135   222
17: 0.010           inefficacy  4609  4915         1504    92   398
18: 0.010           similarity   321   705          784   122   506
19: 0.010 non-trivial efficacy    61   328          612  3645  3912
20: 0.010                 none     9    83          150   110   184
21: 0.001           inefficacy  4620  4917         1494    67   364
22: 0.001           similarity   329   747          844   132   550
23: 0.001 non-trivial efficacy    47   270          527  3625  3848
24: 0.001                 none     4    97          185   145   238
25: 0.200                 <NA>  5000  6031         3050  3969  5000
26: 0.100                 <NA>  5000  6031         3050  3969  5000
27: 0.050                 <NA>  5000  6031         3050  3969  5000
28: 0.025                 <NA>  5000  6031         3050  3969  5000
29: 0.010                 <NA>  5000  6031         3050  3969  5000
30: 0.001                 <NA>  5000  6031         3050  3969  5000
31:    NA           inefficacy 27851 29792         9368   709  2650
32:    NA non-trivial efficacy   456  2066         3772 21794 23404
33:    NA           similarity  1652  3799         4219   637  2784
34:    NA                 none    41   529          941   674  1162
35:    NA                 <NA> 30000 36186        18300 23814 30000
    kappa           conclusion   Δ<0 Δ<0.1 -0.15<Δ<0.15 Δ>0.1   Δ>0
R$ncorrect <- z
bar <- function(x) round(mean(x), 2)
z <- cube(d, .('Δ<0'         = bar(Delta < 0),
              'Δ<0.1'        = bar(Delta < 0.1),
              '-0.15<Δ<0.15' = bar(Delta >= -0.15 & Delta <= 0.15),
              'Δ>0.1'        = bar(Delta > 0.1),
              'Δ>0'          = bar(Delta > 0)),
  by=.q(kappa, conclusion))
R$correct <- z
g <- function(w, kappa) {
  u <- copy(w)
  if(is.na(kappa)) kappa <- 'All'
  vline <- paste(rep('-', 60), collapse='')
  cat('\n', vline, '\nκ=', kappa, '\n', sep='')
  u[, conclusion := ifelse(is.na(conclusion), 'All', conclusion)]
u <- z[, g(.SD, kappa), by=kappa]

             conclusion  Δ<0 Δ<0.1 -0.15<Δ<0.15 Δ>0.1  Δ>0
1:           inefficacy 0.90  0.97         0.32  0.03 0.10
2: non-trivial efficacy 0.02  0.09         0.17  0.91 0.98
3:           similarity 0.35  0.88         0.96  0.12 0.65
4:                 none 0.05  0.52         0.85  0.48 0.95
5:                  All 0.50  0.60         0.30  0.40 0.50

             conclusion  Δ<0 Δ<0.1 -0.15<Δ<0.15 Δ>0.1  Δ>0
1:           inefficacy 0.90  0.97         0.31  0.03 0.10
2: non-trivial efficacy 0.02  0.09         0.17  0.91 0.98
3:           similarity 0.36  0.85         0.94  0.15 0.64
4:                 none 0.05  0.49         0.79  0.51 0.95
5:                  All 0.50  0.60         0.30  0.40 0.50

             conclusion  Δ<0 Δ<0.1 -0.15<Δ<0.15 Δ>0.1  Δ>0
1:           inefficacy 0.91  0.97         0.31  0.03 0.09
2: non-trivial efficacy 0.02  0.09         0.16  0.91 0.98
3:           similarity 0.37  0.85         0.95  0.15 0.63
4:                 none 0.03  0.43         0.74  0.57 0.97
5:                  All 0.50  0.60         0.30  0.40 0.50

             conclusion  Δ<0 Δ<0.1 -0.15<Δ<0.15 Δ>0.1  Δ>0
1:           inefficacy 0.92  0.98         0.30  0.02 0.08
2:           similarity 0.38  0.86         0.95  0.14 0.62
3: non-trivial efficacy 0.02  0.09         0.16  0.91 0.98
4:                 none 0.02  0.41         0.78  0.59 0.98
5:                  All 0.50  0.60         0.30  0.40 0.50

             conclusion  Δ<0 Δ<0.1 -0.15<Δ<0.15 Δ>0.1  Δ>0
1:           inefficacy 0.92  0.98         0.30  0.02 0.08
2:           similarity 0.39  0.85         0.95  0.15 0.61
3: non-trivial efficacy 0.02  0.08         0.15  0.92 0.98
4:                 none 0.05  0.43         0.78  0.57 0.95
5:                  All 0.50  0.60         0.30  0.40 0.50

             conclusion  Δ<0 Δ<0.1 -0.15<Δ<0.15 Δ>0.1  Δ>0
1:           inefficacy 0.93  0.99         0.30  0.01 0.07
2:           similarity 0.37  0.85         0.96  0.15 0.63
3: non-trivial efficacy 0.01  0.07         0.14  0.93 0.99
4:                 none 0.02  0.40         0.76  0.60 0.98
5:                  All 0.50  0.60         0.30  0.40 0.50

             conclusion  Δ<0 Δ<0.1 -0.15<Δ<0.15 Δ>0.1  Δ>0
1:           inefficacy 0.91  0.98         0.31  0.02 0.09
2: non-trivial efficacy 0.02  0.09         0.16  0.91 0.98
3:           similarity 0.37  0.86         0.95  0.14 0.63
4:                 none 0.03  0.44         0.78  0.56 0.97
5:                  All 0.50  0.60         0.30  0.40 0.50
z <- z[kappa == 0.2]
nte.cor   <- z[conclusion == 'non-trivial efficacy', 6]
nte.any   <- z[conclusion == 'non-trivial efficacy', 7]
ineff.any <- z[conclusion == 'inefficacy', 3]
ineff.cor <- z[conclusion == 'inefficacy', 4]
noconc    <- R$propconc['none', '0.001']

The least favorable result is obtained when \(\kappa=0.2\) when the analysis prior is significantly less skeptical than the way the \(\Delta\)s were generated. To highlight two important points from that first part of the table, for trials that were stopped with a conclusion of non-trivial efficacy, 0.91 of such trials were truly generated by \(\Delta > 0.1\). But 0.98 had any efficacy (\(\Delta > 0\)). For trials stopped for inefficacy, 0.9 of them were generated by a harmful treatment. 0.97 of them were generated by a harmful or at most only trivially effective treatment. Inconclusive studies tended to have treatment effect near \(\Delta = 0.1\). Only about 0.024 of the simulated trials resulted in no conclusion when \(\kappa=0.001\), i.e., wasted resources by randomizing 750 participants to each treatment arm only to have no conclusion at all.

To better understand the “no conclusion” results here are the quartiles and 0.1 and 0.9 quantiles of true \(\Delta\)s in effect when trials went to the maximum sample size without stopping.

g <- function(x) {
  z <- quantile(x, c(10, 25, 50, 75, 90)/100)
  as.list(round(z, 2))
d[conclusion == 'none', g(Delta), by=kappa]
   kappa  10%  25%  50%  75%  90%
1: 0.200 0.03 0.07 0.10 0.13 0.16
2: 0.100 0.03 0.06 0.10 0.14 0.17
3: 0.050 0.03 0.07 0.11 0.15 0.19
4: 0.025 0.04 0.07 0.11 0.14 0.17
5: 0.010 0.03 0.07 0.11 0.14 0.18
6: 0.001 0.04 0.07 0.12 0.15 0.18

The median true \(\Delta\) when no conclusion could be reached was very near \(\gamma = 0.1\), the threshold for triviality of treatment effect. So \(\frac{1}{2}\) of the futile trials were generated by trivial efficacy. Very few if any of the trials were generated by a completely ineffective treatment (\(\Delta < 0\)).

Finally, examine the distribution of true \(\Delta\) values as a function of deciles of stopping time, when the conclusion was non-trivial efficacy.

sp <- d[conclusion=='non-trivial efficacy',
        spikecomp(Delta, tresult='segments'), by=cut2(n, g=10)]
ggplot(sp) + geom_segment(aes(x=x, y=y1, xend=x, yend=y2, alpha=I(0.3))) +
    scale_y_continuous(breaks=NULL, labels=NULL) + ylab('') +
    facet_wrap(~ cut2) +
    geom_vline(xintercept=del, col='red', alpha=0.2) +

Many studies stopped at the minimum of N=60 and these were reliable in judging the treatment to have more than trivial efficacy.

9.10 Borrowing Information

  • When the treatment has been well studied in prior randomized clinical trials it is reasonable to use at least some of that information in the prior distribution for the treatment effect in the new study
  • Some discounting is required because there may be doubt about
    • similarity of patients
    • general time trends in outcomes
    • changes in concomitant therapies that could detract from efficacy estimated in patients before such therapies were in use
    • changes in how outcomes are measured including duration of follow-up
  • Some ideal situations for borrowing of information include
    • use of a device on a bone that is not the bone previously studied, or use on skin on another part of the body from that studied
      • studying a device that is a tweak of a device that has been extensively studied previously
    • use of a drug on a different patient population where the difference in patients is not directly related to what the drug is intended to prevent
    • use of adult data about a drug to inform efficacy for treating the same disease in children
  • The posterior probability distribution of the treatment effect from the earlier study/studies becomes part of the prior
  • Important to include results from randomized studies that showed unimpressive treatment effects along with those that were “positive”; no cherry picking allowed
  • Discounting can be done a variety of ways, including power priors and mixtures of priors
  • Mixtures of priors are appealing and more interpretable
  • Mix a prior \(f_1\) that is somewhat skeptical of large treatment effects from the new data alone with an optimistic prior \(f_2\) from previous studies
  • Prior for the new study: \(f(\theta) = (1 - a) f_{1}(\theta) + a f_{2}(\theta)\)
  • \(a\) may be directly interpreted as the applicability of the previous data to the current comparison
  • Applicability can be thought of in two ways (see below); for purpose of Bayesian analysis which of the two you prefer does not matter
  • For either notion of applicability, uncertainties (width of posterior distribution from previous studies) must be carried forward
  • Applicability is a number between 0 and 1 and is either
    • The degree to which you think that relative efficacy of treatment as learned from previous trials applies as the relative efficacy in the current trial or
    • The probability that what is learned about relative efficacy in previous studies fully applies to the new study
  • To anchor the values:
    • \(a=0 \rightarrow\) previous data should be completely ignored
    • \(a=1 \rightarrow\) previous data are completely relevant and the new study is unnecessary
  • To justify a value of \(a\) one should undertake a survey on a reasonable number (say 12-20) of experts including, for drug studies, clinical pharmacologists familiar with absorption and metabolism of the drug in different patient types
  • See this for excellent work the FDA has done related to borrowing adult data for pediatric drug trials

But it is important to note that strong arguments can be made for including the raw data from other studies in a joint analysis with the new trial data, instead of summarizing the other data through a posterior distribution that is converted to a prior and then down-weighted. Down-weighting can come from adding a bias parameter for the historical data when forming the joint new data/old data Bayesian models. Major advantages of this approach include

  • using a standard simple prior (e.g., Gaussian) for discounting
  • making assumptions more explicit
  • explicitly modeling bias
  • getting more accurate analysis, for example by not assuming normality for a point estimate from historical data
  • covariate adjustment

This last point is all-important. If there is a shift in the distribution of important prognostic factors between two data sources, this shift can be handled elegantly and simply using standard covariate adjustment. This is far more effective and efficient than matching or playing with inclusion criteria. Without analyzing the raw data, the effect of covariate shifts is very hard to counter, especially when the outcome model is nonlinear. It is often unclear what part of the covariate distribution (e.g., shift in means vs. a shift in only one of the tails) should be accounted for when using summary and not raw data.

A simple case study in joint raw data modeling for incorporating historical control data into an RCT may be found at fharrell.com/post/hxcontrol. The joint analysis is facilitated by the Bayesian approach, which allows one to specify any number of models for joint analysis. When the models share one or more parameters, interesting things happen. The case study illustrates that

  • When the new RCT has some but not enough control patients, and control information from historical data is incorporated into the analysis, the prior distribution for the bias in the historical data is important. The amount of bias can be estimated.
  • When the new RCT has no controls, so that control information is coming only from the historical data, the prior distribution for the bias in the historical data is incredibly important. The amount of bias cannot be estimated from the data at hand, and comes solely from the bias parameter’s prior distribution. This truly reflects the limitation of not having concurrent controls.

9.10.1 Example Joint Data Model

  • RCT control: \(Y \sim n(\mu_{c}, \sigma)\)
  • RCT active: \(Y \sim n(\mu_{a}, \sigma)\)
  • Historical control: \(Y \sim n(\mu_{c} + b, \sigma)\)
  • Bias \(b \sim n(0, \tau)\)
  • \(\tau\) controls amount of borrowing
  • \(\tau = \infty\): historical data irrelevant
  • \(\tau = 0\): historical data trusted as much as within-study control data
    • Historical data and concurrent controls are estimating the same thing

9.10.2 Extension to Extrapolation

  • Let \(X\) be the extrapolation factor (e.g., age)
  • At the root is treatment \(\times X\) interaction
  • Model: \(Y \sim n(\beta_{0} + \beta_{1}Tx + \beta_{2}X + \beta_{3}\text{Tx} \times X, \sigma)\)
  • No restriction on \(\beta_{3} \rightarrow\) new study stands on its own
  • Skeptical prior on \(\beta_{3} \rightarrow\) borrow information
  • Can also add a nonlinear interaction with a more skeptical prior

fharrell.com/post/ia quantifies how non-overlap of \(X\) distributions relates to the precision of \(X\)-specific treatment effects.