Statistical Design and Analysis Plan for Sequential Parallel-Group RCT for COVID-19

1 Rationale and Background

Dr Harrell has developed detailed supporting documents as part of his 1/4 time appointment to the FDA Center for Drug Evaluation and Research Office of Biostatistics. These documents contain references, examples, detailed explanations and links to other resources. The main resource may be found here with an overview here. Endorsement of these approaches by FDA is not implied. Other background material may be found here.

This trial will adopt a Bayesian framework. Continuous learning from data and computation of probabilities that are directly applicable to decision making in the face of uncertainty are hallmarks of the Bayesian approach. Bayesian sequential designs are the simplest of flexible designs, and continuous learning makes them very efficient by having lower expected sample sizes until sufficient evidence is accrued. Classical null hypothesis testing only provides evidence against the supposition that a treatment has exactly zero effect, and it requires one to deal with complexities if not doing the analysis at a single fixed time. Bayesian posterior probabilities, on the other hand, can be computed at any point in the trial and provide current evidence about all possible questions, such as

  • probability of any beneficial effect
  • probability of a clinically relevant effect
  • probability of harm
  • probability of similarity of outcome of two treatments

The reason that Bayesian inference is more efficient for continuous learning is that it computes probabilities looking ahead—the type of forward-in-time probabilities that are needed for decision making. Such probabilities are probabilities about unknowns based on conditioning on all the current data, without conditioning on unknowns. Importantly, there are no multiplicities to control. This is one of the least well understood aspects of Bayesian vs. frequentist analysis, and it is due to current probabilities superseding probabilities that were computed earlier. An example may help.

Imagine that the military has developed a pattern recognition algorithm to determine whether a distant object is a tank. Initially an image results in an 0.8 probability of the object being a tank. The object moves closer and some fog clears. The probability is now 0.9, and the 0.8 has become completely irrelevant. Contrast with a frequentist way of thinking: Of all the times a tested object wasn’t a tank, the probability of acquiring tank-like image characteristics at some point grows with the number of images acquired. The frequency of image acquisition (data looks) and the image sampling design alter the probability of finding a tank-like image, but that way of looking at the problem is divorced from the real-time decision to be made based on current cumulative data. An earlier tank image is of no interest once a later, clearer image has been acquired. Analogously, with multiple data looks in a clinical trial where there is no treatment effect, the probability of having extreme data at some point grows with the number of looks. This affects type I error but not the true chance that the treatment works. Traditional statistics has multiplicity issues arising from giving more than one chance for data to be extreme (by taking more than one look at the data). It is the need for sampling multiplicity adjustments that makes traditional methods conservative from the standpoint of the decision maker, thus making continuous learning difficult and requiring larger sample sizes. The traditional approach limits the number of data looks and has a higher chance of waiting longer than needed to declare evidence sufficient. It also is likely to declare futility too late.

Bayesian statistics computes chances about unknown parameter values such as treatment effects, not chances about data. That is why it is possible to compute all of the probabilities listed above as often as desired. One can stop a study as soon as evidence is sufficient. The documents linked above (plus a more detailed explanation here) contain a simple simulation study showing that Bayesian posterior probabilities work exactly as advertised even when assessing efficacy each time a patient completes their observation. Bayes has an added advantage when using a skeptical prior distribution: If the study is terminated early because of adequate evidence for benefit, the estimate of the treatment effect is “pulled back” appropriately by the prior. Since the prior distribution’s influence fades as the sample size grows, the amount of discounting of the treatment effect lessens. Traditional group sequential testing in the frequentist domain results in biased point estimates, and the correction for this bias is so complex as to almost never be used in clinical trial publications.

2 Using Bayesian Posterior Probabilities For Decision Making About Treatment Efficacy

The Bayesian paradigm can trigger an automatic optimal decision if the cost, loss, or utility functions for the possible decisions are known. These functions are very difficult to specify when there are competing safety issues, or there are competing perspectives about decisions to be made. So Bayesian inference often stops with provision of one component of the optimum Bayes decision: the posterior probability of efficacy. While it is possible to pre-specify decision triggers such as “we will act as if the drug is effective if the probability of efficacy exceeds 0.95,” but there are always trade-offs against side-effect profiles and other considerations that are yet unknown, so FDA and other entities make decisions on the basis of totality of the evidence. For that reason it is best to provide only example triggers in the statistical plan. Because there are no multiplicities involved with looking at various direct-evidence posterior probabilities, those making the decision can be provided the totality of information at each look.

As described in the formal statistical analysis plan below, we will compute probabilities related to the comparison of hydroxychloroquine vs. placebo. The summary efficacy parameter is an odds ratio (OR) from an ordinal logistic model, with OR > 1 being favorable to hydroxychloroquine because the ordinal outcome scale goes from 1=dead to 7=not hospitalized, able to resume normal daily activities. Posterior probabilities will start to be computed after 20 patients have completed their follow-up, and will be computed every other day thereafter. The probabilities that will be reported will be conditioned on all data currently accumulated, denoted using the expression \(|\,\text{data}\) below.

Probability Evidence For
\(P_1 = P(OR > 1 | \,\text{data})\) any benefit
\(P_2 = P(OR > 1.05 | \,\text{data})\) more than trivial benefit
\(P_3 = P(OR > 1.25 | \,\text{data})\) moderate benefit or greater
\(P_4 = P(OR < 1 | \,\text{data})\) inefficacy or harm
\(P_5 = P(OR < \frac{1}{1.05} | \,\text{data})\) more than trivial harm
\(P_6 = P(OR > \frac{1}{1.1} | \,\text{data})\) non-inferiority with multiplicative margin 1.1
\(P_7 = P(\frac{4}{5} < OR < \frac{5}{4} | \,\text{data})\) similarity

A recently published COVID-19 clinical trial (and discussed here) that has been widely misinterpreted as demonstrating no benefit of a drug just because the p-value did not meet an arbitrary threshold thus committing the “absence of evidence is not evidence of absence” error (in spite of the confidence interval being too wide to warrant a conclusion of no benefit).

The last probability listed above is a direct quantification of evidence for the two treatment arms truly having similar efficacy which avoids the “absence of evidence is not evidence of absence” error often seen when the traditional threshold p < 0.05 is not met.

For this study, example action triggers are:

  • Stop with evidence for efficacy if \(P_1 > 0.95\)
  • Stop with evidence for moderate or greater efficacy if \(P_3 > 0.8\)
  • Stop with evidence for inefficacy if \(P_4 > 0.8\)
  • Stop with evidence for harm if \(P_5 > 0.75\)

As discussed in the section on priors, there are reasons for using different priors for such different purposes.

3 Statistical Analysis Plan

3.1 Sample Size Considerations

Unlike traditional frequentist approaches that propose a fixed sample size based upon desired Type I and Type II error probabilities, a Bayesian approach is not required to fix the sample size in advance. However, it is possible to estimate the maximum sample size that is expected to be needed to answer the study question. We have based our estimate of 438 patients randomized 1:1 treatment to control on the limited available information and the design features of recently proposed studies, which have been designed for 0.9 power to detect an OR of 1.75 for improvement on the outcome using the PO model. If information accruing during the trial differs from that expected, the sample size might change. This has the advantage of ensuring the study does not stop to early if it is approaching a meaningful answer, and also not continuing the study past the point where enough information is known to make a decision.

We recognize our sample size estimate is based on limited information about expected treatment effects, and properties of the outcome measurement, and so may change as new information is accrued.

Unblinded sample size re-estimation in the frequentist setting requires the kind of multiplicity adjustment discussed above. But our Bayesian sequential design allows for flexibility. For example, if a posterior \(P(OR > 1 | \,\text{data}) = 0.8\) at our estimated sample size, rather than stopping the trial, additional subjects could be enrolled with no penalty. This will lessen the chance of an equivocal final result. The way to know that the Bayesian approach is fair is to see how it might fail. One could add 20 patients and \(P(OR > 1 | \,\text{data}) = 0.7\). In this instance, the latter posterior probability supersedes the 0.8 and the decision to continue accruing may be different.

Bayesian power and sample size calculations usually require simulation, and the rapid development of COVID-19 trials does not allow all the Bayesian operating characteristics of each trial to be simulated. A good approximation is to use the frequentist approach as in Section 7.8.3 of BBR for the proportional odds model. As frequentist methods use only a single effect size for the power calculation, the value chosen for this effect not to miss is all-important. Since the prior virtually wears off by the time the target sample size is reached, the resulting frequentist sample size can be labeled as the expected sample size of the Bayesian method. The sequential Bayesian procedure will of course allow for termination before that target sample size, as well as the possibility of study extension should results be promising but equivocal.

Frequentist sample sizes needed to detect a variety of ORs are computed below. For the proportional odds model, power is a function of the OR and the combined treatment arm set of probabilities of the ordinal categories. Assumed outcome probabilities for the seven categories are shown below. Power of both 0.8 and 0.9 are used in the sample size calculations. Sample sizes shown are the total number of patients required.

# Define outcome probabilities for control group
p <- c(2, 1, 2, 7, 8, 38, 42) / 100

# Assuming an odds ratio of 1.75, need to compute
# probabilities over both treatment groups combined
pa <- pomodm(p=p, odds.ratio=sqrt(1.75))

n <- matrix(NA, nrow=3, ncol=4)
rownames(n) <- c('Power: 0.9', 'Power: 0.85', 'Power: 0.8')
pow <- c(0.9, 0.85, 0.8)
ors <- c(1.25, 1.5, 1.75, 2)
colnames(n) <- paste0('OR:', ors)
for(i in 1 : 3) for(j in 1 : 4)
    n[i, j] <- round(posamsize(pa, odds.ratio=ors[j], power=pow[i])$n)
n
            OR:1.25 OR:1.5 OR:1.75 OR:2
Power: 0.9     3019    914     480  313
Power: 0.85    2579    781     410  267
Power: 0.8     2255    683     359  234

3.2 Data Model

The univariate outcome is the 7-level COVID Ordinal Outcomes Scale assessed on Day 15 post randomization, defined in the following table.

Level Description
1 Dead
2 Hospitalized, on ECMO or invasive mechanical ventilation
3 Hospitalized, on high flow \(O_2\) therapy or noninvasive mechanical ventilation
4 Hospitalized, on supplemental \(O_2\)
5 Hospitalized, not on supplemental \(O_2\)
6 Not hospitalized, unable to resume normal daily activities
7 Not hospitalized, able to resume normal daily activities

For this ordinal outcome scale, the proportional odds (PO) ordinal logistic semiparametric model (Walker & Duncan (1967), Harrell (2015)) is an ideal approach, as it allows the distribution across the 7 levels to be arbitrary. The Wilcoxon test is a special case of the PO model when there are no adjustment covariates. We will use a Bayesian PO model. We will incorporate findings from Vanderbilt PhD student Nathan James, who has done a significant amount of research in how best to specify prior distributions for intercepts (6 in this case) in the Bayesian PO model.

The data model for Bayesian analysis is the same as the frequentist PO model except for the addition of prior distributions for the unknown parameters (see below). In addition to the dichotomous treatment assignment variable, the PO model will be adjusted for the following covariates:

  • SOFA score (linear)
  • age (quadratic; 2 parameters)
  • sex
  • levels 2-5 on the WHO ordinal outcome scale measured at baseline (on invasive mechanical ventilation or ECMO; on non-invasive ventilation or high flow nasal cannula; on supplemental oxygen; not on supplemental oxygen; quadratic; 2 parameters)
  • time from symptom onset in days (linear because of constrained range)

Covariate adjustment is used to enhance Bayesian power by adjusting for readily explainable outcome variation.

3.2.1 Extension to Longitudinal Ordinal Outcomes

The data model may easily be extended to a Bayesian hierarchical model to analyze daily or weekly ordinal outcome assessments. Assuming compound symmetric correlation structure one merely adds patient random effects into the model, using prevailing methods of specifying a prior for the variance of Gaussian random effects. With major help from Ben Goodrich, this has been made easy to do in the R rms package’s new blrm function, which uses an exponential distribution as a prior for the standard deviation of the random (patient) effects. In the Bayesian mixed effects model, time would be handled as a flexible fixed effect. For example, the overall time-response profile could be modeled as quadratic in days since randomization. To quantify evidence for changing treatment effect over time (and to estimate a possible delay time until treatment effect) one can compute posterior probabilities on time \(\times\) treatment interaction effects.

Because of the number of choices, longitudinal models require one to prioritize the various contrasts that can be made. Here are some choices with the degrees of freedom for each, assuming time is modeled quadratically. Below \(Tx\) stands for treatment, \(C\) for control, \(t\) is time, and \(L\) is the last time of follow-up.

Estimand d.f. Question Addressed
\(Tx + Tx\times t\) interaction 3 Does \(Tx\) have an effect at any time
\(Tx - C\) at \(t=L\) 1 Does \(Tx\) have an effect at time \(L\)
\(Tx - C\) averaged over \(t=0 \rightarrow L\) 1 Does \(Tx\) have an effect on the average

Note that omitting the \(Tx \times t\) interaction is similar to the last contrast.

The Bayesian power calculations and stopping times presented here are for a univariate ordinal outcome. If repeated ordinal assessments are used, power will be greater and stopping time shorter due to a higher effective sample size.

3.3 Prior Distributions

As recommended by our current research, intercepts will be given essentially nonparametric distributions using the Dirichlet distribution. The treatment effect, appearing as a log odds ratio in the PO model, will have a normal distribution such that

  • Prior \(P(OR > 1) = P(OR < 1) = 0.5\) (equally likely harmful as beneficial)
  • Prior \(P(OR > 2) = P(OR < \frac{1}{2}) = 0.025\) (large effects in either direction unlikely)

The variance of the prior normal distribution for log OR is computed to satisfy tail areas such as that above. In general, let R denote an unknown effect ratio (odds ratio or hazard ratio, for example). For equal chance of benefit as harm the mean of the prior distribution on the log R scale is zero and P(R < 1) = P(R > 1) = 0.5. The following table shows the constraint on R corresponding to each of several prior distributions along with the tail probabilities and the value σ that enforces those prior tail probabilities. Also shown are the prior probabilities of a large effect (effect ratio < 0.75 i.e. 25% reduction) and the prior probability of similarity, with similarity of treatment effect taken to be R in the interval [0.85, 1/0.85].

d <- expand.grid(p=c(0.025, 0.05, 0.1, 0.4), k=c(0.05, 0.1, 0.25, 0.5))
d <- transform(d, 
  w = paste0('P(R < ', k, ') = P(R > ', 1/k, ') = ', p),
  sigma = log(k) / qnorm(p))
d <- transform(d,
  big = round(pnorm(log(0.75) / sigma), 3),
  sim = round(pnorm(-log(0.85) / sigma) - pnorm(log(0.85) / sigma), 3),
  sigma = round(sigma, 3))
d <- d[c('w', 'sigma', 'big', 'sim')]
names(d) <- c('Effect Constraint', 'σ', 'P(R < 0.75)', 'P(0.85 < R < 1/0.85)')
knitr::kable(d, format='pipe')
Effect Constraint σ P(R < 0.75) P(0.85 < R < 1/0.85)
P(R < 0.05) = P(R > 20) = 0.025 1.528 0.425 0.085
P(R < 0.05) = P(R > 20) = 0.05 1.821 0.437 0.071
P(R < 0.05) = P(R > 20) = 0.1 2.338 0.451 0.055
P(R < 0.05) = P(R > 20) = 0.4 11.825 0.490 0.011
P(R < 0.1) = P(R > 10) = 0.025 1.175 0.403 0.110
P(R < 0.1) = P(R > 10) = 0.05 1.400 0.419 0.092
P(R < 0.1) = P(R > 10) = 0.1 1.797 0.436 0.072
P(R < 0.1) = P(R > 10) = 0.4 9.089 0.487 0.014
P(R < 0.25) = P(R > 4) = 0.025 0.707 0.342 0.182
P(R < 0.25) = P(R > 4) = 0.05 0.843 0.366 0.153
P(R < 0.25) = P(R > 4) = 0.1 1.082 0.395 0.119
P(R < 0.25) = P(R > 4) = 0.4 5.472 0.479 0.024
P(R < 0.5) = P(R > 2) = 0.025 0.354 0.208 0.354
P(R < 0.5) = P(R > 2) = 0.05 0.421 0.247 0.300
P(R < 0.5) = P(R > 2) = 0.1 0.541 0.297 0.236
P(R < 0.5) = P(R > 2) = 0.4 2.736 0.458 0.047

When the tail probability is large, say 0.4 in the above table, σ is large enough that the prior on the log R scale is virtually flat, and the prior probability of a large reduction (and of a large increase) in bad outcomes is large. At that σ the prior probability of similarity is at a minimum.

For effects of adjustment covariates, the prior distributions will be normal with mean 0 and much larger variances to reflect greater uncertainty in their potential effects.

For assessing evidence for effectiveness it is recommended that a skeptical prior be used. For assessing inefficacy (including harm) and similarity a flat prior may be recommended. The operating characteristics presented below examined various combinations of priors and types of assessments, whereas only some of these combinations are pertinent.

3.4 Operating Characteristics

Bayesian operating characteristics include computing quantities such as

  • the probability that the posterior probability of any efficacy will ever exceed 0.95 over any of the sequential data looks, given the prior distribution specified above (Bayesian)
  • the probability of ever exceeding 0.95 as a function of the true OR (including cases where the OR is near 1.0)

Simulations of such quantities are fast if a normal approximation is used for the data likelihood and a normal distribution is used for the prior for the log odds ratio. The posterior distribution is then also normal, with mean that is a discounted maximum likelihood estimate of the log hazard ratio (with almost no discounting as the sample size grows) and variance that is the harmonic mean of the ordinary variance of the log odds ratio and the prior variance.

The following simulations are set up as follows:

  • maximum sample size of N=500
  • data looks after patient 20, 21, 22, …, 500
  • OR set to 0.8, 0.801, 0.802, …, 2.5
  • 5 types of evidence triggers
  • prior \(P(OR > 2)\) set to 0.01, 0.025, 0.05, 0.1, 0.25 (with 0.49 added as a check against calculations with flat prior)
  • one simulated RCT per OR, estimate trends as smooth functions of ORs
  • no covariate adjustment (so Bayesian power will be slightly underestimated)
N       <- 500    # maximum number of patients
ors     <- seq(.8, 2.5, by=0.001)   # true ORs to consider

# Function to simulate n/2 control patients and n/2 treated patients
# under a given odds ratio (n/2 on the average for each group)
simdat <- function(p, n, or) {
  treat <- sample(0 : 1, n, replace=TRUE)
  n0 <- sum(treat == 0)
  n1 <- sum(treat == 1)
  y  <- rep(0, n)
  y[treat == 0] <- sample(1 : 7, n0, p, replace=TRUE)
  ## Modify p by odds ratio
  p1 <- pomodm(p=p, odds.ratio=or)
  y[treat == 1] <- sample(1 : 7, n1, p1, replace=TRUE)
  data.frame(treat, y)
}

# Check calculation by using OR=2 and fitting a large simulated sample
# d <- simdat(pa, 100000, 2)
# exp(coef(lrm(y ~ treat, data=d))['treat'])

# Function to run the simulations for one OR
# For each OR simulate one trial
# For the trial simulate all N patient outcomes but
# reveal the outcomes sequentially starting when 20 pts accrued
# For each of N-19 sequential looks at the data, compute the
# log odds ratio and its standard error

simor <- function(or, p, N) {
  beta  <- se <- rep(NA, N)  
  d     <- simdat(p, N, or)

  for(look in 20 : N) {
    f    <- lrm(y ~ treat, data=d[1 : look, ])
    beta[look] <- coef(f)['treat']
    se[look]   <- sqrt(vcov(f)['treat', 'treat'])
      }
    cbind(or, look = 1 : N, beta, se)
}

# To run simor for a series of ORs
simors <- function(ors,  p, N, progress=TRUE) {
  nor <- length(ors)
  R <- matrix(NA, nrow=nor * N, ncol=4,
           dimnames=list(NULL, c('or', 'look', 'beta', 'se')))
    j <- 0
  for(i in 1 : nor) {
    if(progress && (i %% 10 == 0))
       cat('OR:', ors[i], '\n', file='progress.txt', append=TRUE)
    s <- simor(ors[i], p, N)
    j <- j + 1
        k <- j + N - 1
        R[j : k, ] <- s
    j <- k
    }
R
}

cat('', file='progress.txt')   # initialize
if(! simdone) {
  simResults <- simors(ors, p, N)
    Save(simResults)
    } else Load(simResults)
S <- as.data.frame(simResults)

# Define evidence targets with OR threshold and post prob cutoffs
trig <- c(efficacy            = 'P(OR > 1) > 0.95\nEfficacy',
          'moderate efficacy' = 'P(OR > 1.25) > 0.8\nModerate Efficacy',
          inefficacy          = 'P(OR < 1) > 0.8\nInefficacy',
          harm                = 'P(OR < 1/1.05) > 0.75\nSignificant Harm',
          similarity          = 'P(4/5 < OR < 5/4) > 0.7\nSimilarity')

types <- names(trig)
numtypes        <- 1 : 5   # used when creating all-numeric matrices
names(numtypes) <- types

findfirst <- function(x, v=N + 1) {
  j <- which(x)
  if(length(j)) min(j) else v
}
ngt <- function(x, mean=m, sd=s) pnorm(x, mean, sd, lower.tail=FALSE)
nlt <- function(x, mean=m, sd=s) pnorm(x, mean, sd, lower.tail=TRUE)

# 0.499 -> SD=277, an effectively uninformative flat prior
ptails <- c(0.01, 0.025, 0.05, 0.1, 0.25, 0.499)

First quantify the effect of the degree of skepticism in the prior by computing as a function of the sample size (look number) the difference between the posterior probability of efficacy under a flat prior and under each skeptical prior.

beta  <- S$beta
se    <- S$se
sor   <- S$or
slook <- S$look
L <- matrix(NA, nrow=sum(ors > 1) * length(ptails), ncol=4,
            dimnames=list(NULL,
             c('or', 'look', 'ptail', 'nsac')))
P <- matrix(NA, nrow=N * sum(ors > 1), ncol = 3 + length(ptails))
cptails <- as.character(ptails)
colnames(P) <- c('look', 'beta', 'pflat', cptails)

is <- j <- 0
be <- seb <- numeric(N)
for(or in ors[ors > 1]) {
    # For all N looks at the current OR compute post P(OR > 1) under
    # flat prior
    i         <- sor == or
    l         <- slook[i]
    be[l]     <- beta[i]
    seb[l]    <- se[i]
    peff.flat <- ngt(0, mean=be, sd=seb)
    P[is+l, 1:3] <- cbind(l, be, peff.flat)
    for(ptail in ptails) {
    g <- gbayes(mean.prior=0, cut.prior=log(2), cut.prob.prior=ptail,
                stat=be, var.stat=seb^2 )
        m    <-      g$mean.post
    s    <- sqrt(g$var.post)
        peff <- ngt(0)   # posterior prob for current skeptical prior, all looks
        P[is+l, as.character(ptail)] <- peff
        nsac <- NA

    # For skeptical priors, find first look with post prob > 0.95
        # then find lead time by finding at what earlier look the post
        # prob under a flat prior was > 0.95
        # If efficacy trigger never happened, treat as happening at look
        # 501 if flat prior trigger ever happened
        # Posterior under flat prior has mean = beta, SD = se

        istop <- findfirst(peff >= 0.95)
        if((istop < 501) | (istop == 501 &
                                                (max(peff.flat, na.rm=TRUE) >= 0.95))) {
            istopflat <- findfirst(peff.flat[1:min(500,istop)] >= 0.95, 0)
            if(istopflat == 0) {
                w <- paste0('Program logic error: or=', or, '  ptail=', ptail,
                                '  look=', look)
                stop(w)
                }
            nsac <- istop - istopflat
        }
        j <- j + 1
        L[j, ] <- c(or, istop, ptail, nsac)
    }
    is <- is + N
}
P <- P[P[, 1] >= 20, ]
P <- P[, -9]
W <- P
W[, 4:8] <- W[, 3] - W[, 4:8]
W <- W[, -(2:3)]
look <- W[, 1]
W <- W[, -1]
w <- summarize(W, look, function(y) apply(y, 2, mean, na.rm=TRUE))
names(w)[2:6] <- colnames(W)
x <- melt(data.table(w), id.vars='look', variable.name='ptail')
plegend <- guides(color=guide_legend(title='Prior P(OR > 2)'))
ggplot(x, aes(x=look, y=value, color=ptail)) + geom_smooth() +
    plegend + xlab('Look') + ylab('Mean Reduction in Posterior')
Mean reduction in posterior probabilities from  the posterior probability arising from a flat prior, as a function of the sample size (look number) and degree of skepticism in the prior.

Figure 3.1: Mean reduction in posterior probabilities from the posterior probability arising from a flat prior, as a function of the sample size (look number) and degree of skepticism in the prior.

Next compute the median number of looks earlier that a flat prior would have hit the 0.95 efficacy trigger when the skeptical prior first triggered. This is done only for true ORs > 1. If the flat prior ever triggered but the skeptical prior did not, consider the skeptical prior to have triggered at look 501 when computing the lag.

w <- as.data.frame(L)
# Treat non-stopped trials as look delay of zero
w <- transform(w, nsac = ifelse(is.na(nsac), 0, nsac))
if(any(subset(w, ptail > 0.4)$nsac > 0)) stop('probable program error')
w <- subset(w, ptail < 0.4)
# Estimate median lag by quantile regression
w$ptailf <- as.factor(w$ptail)
dd <- datadist(w); options(datadist='dd')
f  <- Rq(nsac ~ rcs(log(or), 5) * ptailf, data=w)
ggplot(Predict(f, or, ptailf), xlab='True OR', ylab='Median Lag') + plegend
Medium number of looks earlier that a flat prior would have stopped for $P(OR > 1 | \,\text{data}) \geq 0.95$

Figure 3.2: Medium number of looks earlier that a flat prior would have stopped for \(P(OR > 1 | \,\text{data}) \geq 0.95\)

Next performance of all triggers is evaluatied by simulation.

focus <- data.frame(or    = c(0.8, 1.0, 1.75),
                    type  = c('inefficacy', 'similarity', 'efficacy'),
                                        ptail = c(0.499, 0.499, 0.025))
## Don't match on OR because we have 1 trial per OR and get estimates
## at specific ORs by smooth interpolation
combf <- function(or, type, ptail)
  paste0(type, '  P(OR > 2)=', ptail)
focus$combo <- with(focus, combf(or, type, ptail))

# For each individual simulation (one OR) and each evidence trigger
# compute the first look where the trigger happened
if(simlooksdone) {
    Load(simlooks)
    R <- simlooks
    Load(simlooks2)
    H <- simlooks2
} else {

    R <- matrix(NA, nrow=length(ors) * length(ptails) * length(types), ncol=4)
    colnames(R) <- c('or', 'type', 'ptail', 'first')
    j <- 0
    H <- matrix(NA, nrow=length(ors) * N * nrow(focus), ncol=5)
    colnames(H) <- c('or', 'type', 'ptail', 'look', 'chit')
    hs <- 1
    he <- N
    
    csum <- function(x) {
        x[is.na(x)] <- 0
        cumsum(x)
    }

    for(uor in unique(S$or)) {
        w <- subset(S, or == uor)
        if(nrow(w) != N) stop('program logic error 1')
        for(ptail in ptails) {
            g <- gbayes(mean.prior=0, cut.prior=log(2), cut.prob.prior=ptail,
                  stat=w$beta, var.stat=(w$se)^2 )
            m <- g$mean.post
        s <- sqrt(g$var.post)

            look <- w$look
            if(! all(look == 1 : N)) stop('program logic error 2')
        for(type in types) {
            hit <-
            switch(type,
                     efficacy   = ngt(0) > 0.95,
                   'moderate efficacy' = ngt(log(1.25)) > 0.8,
                        inefficacy = nlt(0) > 0.8,
                        harm       = nlt(-log(1.05)) > 0.75,
                        similarity = (nlt(log(1.25)) - nlt(-log(1.25))) > 0.7
                    )

        if(combf(uor, type, ptail) %in% focus$combo) {
              H[hs:he,] <- cbind(uor, numtypes[type], ptail, look, csum(hit))
                hs <- he + 1
                he <- he + N
            }
        first <- findfirst(hit)
        j <- j + 1
            R[j, ] <- c(uor, numtypes[type], ptail, first)
            }
    }
    }

    simlooks <- as.data.frame(R)
    simlooks$type <- names(numtypes)[simlooks$type]
    Save(simlooks)

simlooks2 <- as.data.frame(H)
    simlooks2$type <- names(numtypes)[simlooks2$type]
    Save(simlooks2)
}

The first simulation summary will display the probability of a trigger happening any time during the looks after patient 20, 21, …, 500 as a function of the true OR and type of evidence being assessed. The simulations are summarized by fitting flexible binary logistic models predicting the trigger probability.

fit <- function(d) {
    y   <- d$first < 501
    ors <- seq(.8, 2.5, length=200)
    if(all(y == 0))      prob <- rep(0, 200)
    else if(all(y == 1)) prob <- rep(1, 200)
    else {
        f <- lrm(y ~ rcs(log(or), 5), data=d, maxit=30)
        prob <- plogis(predict(f, data.frame(or=ors)))
    }
    cbind(or=ors, prob=prob)
}
R <- as.data.frame(R)
r <- matrix(NA, nrow=length(types) * length(ptails) * 200, ncol=4)
colnames(r) <- c('type', 'ptail', 'or', 'prob')
j <- 0
for(ty in types) {
    for(pta in ptails) {
        d  <- subset(R, ptail==pta & type==ty)
        sm <- fit(d)
        j  <- j + 1
        k  <- j + 199
        r[j:k, ] <- cbind(numtypes[ty], pta, sm)
        j <- k
    }
}
r <- as.data.frame(r)
r$type <- names(numtypes)[r$type]
ggplot(r, aes(x=or, y=prob, color=as.factor(ptail))) + geom_line() +
    facet_wrap(~ trig[type]) + xlab('True OR') + ylab('P(Stop by 500)') +
    theme(legend.position=c(.85, .17)) +
  plegend
Simulated probability of stopping by 500 patients/looks as a function of the unknown OR.  Curves corresponding to a prior probability tail area of 0.499 respresent a flat prior distribution.

Figure 3.3: Simulated probability of stopping by 500 patients/looks as a function of the unknown OR. Curves corresponding to a prior probability tail area of 0.499 respresent a flat prior distribution.

Were the unknown OR to be exactly 1.0, the above efficacy panel shows a non-trivial chance of a conclusion of efficacy (\(OR > 1\)) by look number 500, with the chance varying over the four prior distributions. Only the prior such that \(P(OR > 2) = 0.01\) is very safe, but using this prior would make the study take longer to detect real efficacy. But there is a better way to think about this issue: what is wrong with, say, a posterior \(P(OR > 1)\) of 0.96 at the moment of stopping for efficacy? The answer is that nothing at all is wrong with it, if the posterior probability was computed using a prior distribution that is deemed acceptable by the person who judges “what is wrong.” An observer who worries a great deal about the chance of ever declaring efficacy were the unknowable OR actually equal to 1.0 is in effect one who is using a prior with a discontinuity at 1.0 (say a 50/50 mixture of a normal distribution for the log odds ratio and a lump of probability at log odds = 0). Had the posterior been computed under such a prior, the posterior probability at the moment of stopping would, by mathematical necessity, have to be acceptable to that skeptical observer. As Tukey argued decades ago, it is impossible for a treatment to have exactly zero effect. And i f you assume the impossible, strange things will happen.

The above calculations will now be re-phrased to vary the number of looks continuously and only compute stopping probabilities for three ORs: 0.75, 1.0, and 1.75. The graphs below show the probability of stopping by the number of looks (patients) given on the \(x\)-axis. Only three combinations OR, target, and prior will be studied.

targetprob <- c(inefficacy=0.8, similarity=0.7, efficacy=0.95)
H$comb <- with(H, combf(or, type, ptail))
dd <- datadist(H); options(datadist='dd')
# Check: look at look 500
# h <- subset(H, look==500)
# f <- lrm(chit > 0 ~ rcs(or, 5) * comb, data=h, tol=1e-13, maxit=24)
#ggplot(Predict(f, or, comb, fun=plogis))   # P(efficacy) approx .95

ps <- list()
for(co in unique(H$comb)) {
    f <- lrm(chit > 0 ~ rcs(look, 5) * rcs(or, 5),
                     data=H, subset=comb==co, tol=1e-13, maxit=25)
    o <- c("efficacy  P(OR > 2)=0.025"    = 1.75,
                 "inefficacy  P(OR > 2)=0.499"  = 0.8,
                 "similarity  P(OR > 2)=0.499"  = 1)
    ps[[co]] <- Predict(f, look, or=o[co], fun=plogis, conf.int=FALSE)
}
p <- rbind('Efficacy when OR=1.75\nskeptical prior' = ps[[1]], 
                     'Inefficacy/harm when OR=0.8\nflat prior'= ps[[2]], 
                     'Similarity when OR=1.0\nflat prior'     = ps[[3]])
ggplot(p, xlab='Look', ylab='Cumulative Probability of Stopping') +
    guides(color=guide_legend(title=''))
Cumulative probability of stopping by a given cumulative sample size ($x$-axis), for three targets.  The probability of stopping is estimated at a different true unknown OR for each target, and only the skeptical prior with P(OR > 2)= 0.025 is used for assessing efficacy.

Figure 3.4: Cumulative probability of stopping by a given cumulative sample size (\(x\)-axis), for three targets. The probability of stopping is estimated at a different true unknown OR for each target, and only the skeptical prior with P(OR > 2)= 0.025 is used for assessing efficacy.

Next consider expected time until the various triggers would happen. Since for many of the simulations a given evidence trigger did not occur at any time through the inclusion of the 500th patient, the time-to-trigger is right censored. The mean and median time-to-trigger for each OR and type of trigger were estimated using a flexible lognormal survival model with right censoring. When the mean/median estimate is well beyond the maximum sample size of 500, it may be considered infinite for all practical purposes. Estimated means and medians are truncated at 5000 looks. The results are plotted below.

fitln <- function(d) {
    ors     <- seq(.8, 2.5, length=200)
    y       <- d$first
    stopped <- y < 501
    if(sum(stopped) < 2) medp <- meanp <- rep(10000, length(ors))
    else {
        y <- pmin(y, 500)
        or <- d$or
        f <- psm(Surv(y, stopped) ~ rcs(log(or), 5), dist='lognormal')
        meanp <- Predict(f, or=ors, fun=exp, conf.int=FALSE)$yhat
        M     <- Mean(f)
        medp  <- Predict(f, or=ors, fun=M)$yhat
    }
    rbind(data.frame(stat='median', or=ors, est=medp),
                data.frame(stat='mean',   or=ors, est=meanp))
}

r <- NULL
for(ty in types) {
    for(pta in ptails) {
        d <- subset(R, ptail==pta & type==ty)
        w <- fitln(d)
        r <- rbind(r,
                    data.frame(type=ty, ptail=pta, or=w$or, stat=w$stat, est=w$est))
    }
}
g <- function(st) {
    yl <- paste(upFirst(st), 'Stopping Time')
    ggplot(subset(r, stat==st),
                 aes(x=or, y=pmin(est, 5000), color=as.factor(ptail))) +
         geom_line() +
           facet_wrap(~ trig[type]) +
         scale_y_continuous(trans='log10',
               breaks=c(10, 50, 75, 100, 200, 300, 400, 500,
                        1000, 2000, 5000, 10000)) +
       xlab('True Odds Ratio') + ylab(yl) +
       theme(legend.position = c(.85, .3)) +
           plegend
}
g('mean')
Mean number of looks before stopping for various triggers and priors

Figure 3.5: Mean number of looks before stopping for various triggers and priors

g('median')
Median number of looks before stopping for various triggers

Figure 3.6: Median number of looks before stopping for various triggers

As shown here, posterior probabilities remain perfectly calibrated even with infinitely many data looks and instant stopping occurs after the first time the posterior probability exceeds some cutoff. The only way for the posterior probabilities to be seen as miscalibrated to an observer would be for that observer to insert their own prior that is in conflict with the prior specified above.

Note that type I assertion probability is not relevant to Bayesian optimum decision making. Though many investigators wish to show that type I probability is controlled, this has nothing to do with the probability of making a mistake in concluding that a drug is effective. That probability would be \(P(OR <= 1 | \,\text{reject}\, H_0)\) and is not part of traditional inference. Type I error probability is the chance of asserting efficacy when the efficacy is exactly zero, i.e., \(P(\text{reject}\, | OR=1)\). A probability calculation that assumes \(H_0\) is true cannot shed light on the veracity of what has already been assumed (\(H_0\)). In addition, type I probability involves the sample space (Bayesians are only interested in chances about the OR not chances about sample data) and hence cannot be computed in a non-fixed-sample size situation without knowing the intentions of the decision maker.

3.5 Statistical Computing and Reporting

The Stan-based rstanarm R package will be used to fit the above model and construct posterior distributions for all parameters. Nathan James’ R package using rstanarm is here. Ben Goodrich of Columbia University has Stan code and notes here. More information on Bayesian proportional odds modeling code is being updated here.

The entire posterior density function for the treatment effect (log OR) will be provided, along with the cumulative posterior distribution, which quantifies evidence for all possible treatment effects. Particular probabilities of OR intervals as listed above will also be computed, along with two-sided 0.95 Bayesian credible intervals and posterior mean, median, and mode of the unknown effect. In addition to evidence about the primary efficacy OR parameter, posterior distributions for all outcome probabilities and differences in outcome probabilities (when baseline covariates are set to particular values \(x\) such as their means) will be given. The latter represent absolute risk differences for \(P(Y \geq y | \,X=x,\,\text{data})\), where \(j = 2, \ldots, 7\). For example when \(y=6\) these are probabilities of being alive at home. The following table of posterior mean risks from the proportional odds model will be constructed:

Outcome Probability
Drug
Probability
Control
Alive
Alive without ECMO
Alive without high flow \(O_2\)
Alive without supplemental \(O_2\)
At home
At home, resume normal activities

Markov chain Monte Carlo simulation convergence will be checked using diagnostics provided by rstanarm. We note that the R brms package makes it easy to use Bayesian PO models, but the package is not flexible enough on the priors for intercepts, at least when there is a large number of them.

Evidence for differential treatment effect (also called heterogeneity of treatment effect) will be assessed by including interaction between treatment and each of the four pre-specified covariates one-at-a-time. A detailed analytic plan for this approach may be found in Section 13.6 of this document, which details problems with the traditional subgroup approach. Our more precise and interpretable model-based approach will involve computing posterior probabilities of clinically meaningful differential effects for each covariate, and providing posterior mean treatment effects as a function of each covariate, along with 0.95 credible interval bands. [Note that subgrouping methods do not work for continuous covariates and can be misleading if the subgrouping variable is collinear with another covariate.]

3.6 Interim Analyses

As described above, posterior probabilities will start to be computed after 20 patients have completed their follow-up and will be computed every other day thereafter.

Harrell, F. E. (2015). Regression Modeling Strategies, with Applications to Linear Models, Logistic and Ordinal Regression, and Survival Analysis (Second edition). Springer. https://doi.org/10.1007/978-3-319-19425-7

Walker, S. H., & Duncan, D. B. (1967). Estimation of the probability of an event as a function of several independent variables. Biometrika, 54, 167–178.