Asking one to compute type I assertion probability α for a Bayesian design is like asking a poker player winning > $10M/year to justify his ranking by how often he places bets in games he didn’t win.

Do you want the probability of a positive result when there is nothing? Or the probability that a positive result turns out to be nothing?

There are many goals of comparing treatments, including quantifying evidence for

any efficacy

non-trivial efficacy

inefficacy

harm

similarity

non-inferiority

and sometimes the goal is just to estimate the treatment effect, whatever it is. Bayesian methods provide probabilities that each of the above conclusions is correct, given the data, the data model, and the prior distribution for the treatment effect. For estimation, Bayes provides a complete probability distribution for the unknown effect, and corresponding interval estimates (highest posterior density intervals, credible intervals). With the enormous number of traditional frequentist clinical trials that reach their planned fixed sample size but have equivocal results, one particular advantage of Bayesian sequential designs is all-important: minimizing the chance of getting to the maximum sample size with no conclusion about efficacy.

A major advantage of Bayes is that it provides evidence in favor of assertions, not just evidence against a hypothesis. That explains why Bayesian designs and analyses are the same (other than the assertion whose probability is being computed) for all study goals. Only the required sample size may differ by the trial’s intent.

Another major advantage of Bayes is that since it involves probabilities about unknowns given the current data, it does not have multiplicity problems with sequential evidence assessment (Chapter 5). Evidence from current data are not down-weighted because the data were examined previously or because they will be examined in the future. Bayesian trial design need not have a sample size, and if it does, there is nothing wrong with exceeding that sample size when results are equivocal. This implies that Bayesian designs are less likely to result in a trial having an equivocal result. And the more often one analyzes the data the lower the expected sample size. One can stop as soon as there is a pre-specified amount of evidence for one of the conclusions listed above. When sample size N is discussed below, it can refer to a planned maximum sample size, an average sample size, the 75\(^\text{th}\) percentile of a distribution of sample sizes, etc.

Evidence from current data is the point of emphasis. Correctness of a decision is the probability that a decision made using the current data is correct. In frequentism, multiplicity comes from multiple chances to observe extreme data, which grows as the number of data looks increases. Bayesian posterior probabilities are about efficacy, not about data.

With a Bayesian sequential design it is easy to add a requirement that the treatment effect be estimated with a specified precision. Often this precision (e.g., half-width of a credible interval) can be estimated before the trial starts. Then the precision-required sample size can define the time of the first data look.

9.2 Challenges

Change conceptions about operating characteristics

Primary OC must be the correctness of whatever decision is made

Be willing to stop a study very early when there is a very low probability of non-trivial efficacy

Whether to stop a study early for a conclusion of similarity of efficacy of the two treatments

Abandon need for estimating treatment effect when there is ample evidence for inefficacy

Be objective and formal about interpretation of effects by specifying a threshold for non-trivial effect

Removing cost from consideration of effect sizes to detect in favor of finding an outcome variable that will have rich enough information to achieve power with clinically-chosen effect size

Regulators and sponsors must recognize that the operating characteristic that needs to be optimized is the correctness of whatever decision is made about treatment effectiveness.

Sponsors must be willing to stop studies very early when there is strong evidence of inefficacy. This will free up patients with rare diseases for other trials, and free up resources likewise.

To be able to stop studies very early so as to not waste resources when the probability is very high that there is no non-trivial efficacy, sponsors must jettison the need to be able to estimate how ineffective a treatment is.

At present, consideration of whether a nonzero treatment effect is trivial is highly subjective and informal. To make optimal and defensible decisions, this must be formalized by specifying a threshold for non-triviality of effect. Importantly, this threshold is smaller than the MCID.

The current practice of choosing a more than clinically important effect size to use in power calculations has been called “sample size samba” and has resulted in a scandalous level of overly optimistically designed trials, commonly having an equivocal result. Many equivocal trials observed an effect that was merely clinically significant. Powering the study to only detect a “super clinically significant” effect results in low actual power and wasted resources. Objectively choosing effect sizes without cost considerations, then abandoning studies that cannot find an acceptable outcome variable that has sufficient resolution to achieve adequate power with a “real” effect size will prevent a large amount of waste and make clinical trials more ethical while freeing up patients with rare diseases to participate in other trials.

9.3 Bayesian Operating Characteristics

The primary Bayesian operating characteristic is how often a conclusion is correct. The primary frequentist operating characteristic is how often one reaches a conclusion.

Bayesian inference allows one to make both positive (“the drug works”) and negative (“the drug doesn’t work”) conclusions. Evaluation of Bayesian operating characteristics can focus on either the probability that whatever decision is made is correct, or the correctness of a single decision such as a drug works. Frequentism based on null hypothesis testing and P-values only brings evidence against a hypothesis, so the only possible actions are (1) to reject \(H_0\) (the one actual conclusion a frequentist approach can make), (2) get more data, or (3) stop with no conclusion except a failure to reject \(H_0\). The operating characteristic given greatest emphasis by regulators has been type I assertion probability \(\alpha\), the probability of rejecting \(H_0\) when it is magically known to be true.

When designs are being formulated and when N is being selected, researchers need to know that the typical way the trial is likely to unfold is acceptable. The performance of a proposed Bayesian design must be judged by its Bayesian operating characteristics (OCs). Bayes is about prospective analysis and interpretation, i.e., trying to uncover the unknown based on what is known. It minimizes the use of unobservables in acting in forward information-flow predictive mode. Frequentist OCs deal with what may happen under unknowable assumptions about the treatment effect generating the data. Bayesian OCs are the only OCs important when running Bayesian analyses, interpreting the results, and knowing how much to trust them. A key point is that accuracy of conclusions pertains to what has been learned from data, not what data may arise under certain unknowable assumptions about effects. Accuracy is a post-data consideration.

Bayesian OCs are of four types. First there is assurance, which is the ability of the design to detect a real treatment effect were it to exist. This is Bayesian power (the chance of achieving a high posterior probability of efficacy). Second is precision of post-data knowledge about the treatment effect, e.g., half of the width of an uncertainty interval. Third is the expecting stopping time, which is related to the efficiency of the design and the expected cost of the study. Fourth is the all-important judgment of the correctness of the decision at the moment the decision is made. Bayesian approaches optimize the chance of conclusions being correct if the following three conditions hold:

The first two conditions are in common with traditional frequentist approaches.

the variables being analyzed are measured unbiasedly

the data model is consistent with the experimental design and is reasonably well-fitting

the prior used in the analysis is not in conflict with the prior used by the judge of the accuracy of the study results

Quantifying correctness of conclusions for a Bayesian design is equivalent to assessing predictive accuracy, which involves examining what’s real as a function of predicted values. Correctness of conclusions (predictions) is easy to check in a simulation study, as explained in detail in Section 9.9. The process involves looking back to the treatment effect that generated the current trial data after the conclusion about the treatment effect has been made. After a large number of trials are simulated, the subset of trials reaching a certain conclusion is chosen, and for that subset the proportion of trials for which the true treatment effect is consistent with the conclusion is computed.

When Bayes is used to the fullest, decisions take consequences into account through a utility, loss, or cost function. The optimum Bayes decision that maximizes expected utility is a function of the posterior distribution of the treatment effect and the utility function (decision consequences). But different stakeholders have different utility functions, so we are usually content to deal only with the posterior distribution, a major input into the optimum decision.

In actual studies most of the attention about this most important OC is focused on the choice of the prior distribution. When the prior is acceptable, the Bayesian procedure will result in optimum decisions in the eye of the beholder of the prior, by definition. Sensitivity of decisions to the choice of a prior would be a useful assessment at the design stage.

In Chapter 5 Bayesian sequential analysis is studied in more detail, though the simulations there are simple from the standpoint of having only one reason for stopping. In that situation, one stops, for example, when a posterior probability PP exceeds 0.95. When the simulation is done using the same prior as the prior used in the analysis, if the PP is 0.96 upon stopping, the true probability of a treatment effect will be exactly 0.96. Trusting the Bayesian results at the decision point requires the PP being well calibrated. This happens when there is no major conflict between the design prior and the analysis prior.

Maximize the probability of making the right decision, with 4 decisions possible (efficacy, similarity, inefficacy, not possible to draw any conclusion with available resources)

Minimize the probability of an inconclusive result at study end

Minimize the expected sample size

Stop as early as possible if there is a very high probability that the treatment’s effect is trivial or worse, even if the magnitude of the effect cannot be estimated with any precision

If stopping early for non-trivial efficacy, make sure that the sample size provides at least crude precision of the effect estimate

Do this by not making the first efficacy look until the lowest sample size for which at least crude precision of the treatment effect is attained

Avoid gaming the effect size used in a power calculation

Recognize that traditional sample sizes are arbitrary and require too many assumptions

Recognize that what appears to be a study extension to one person will appear to be an interim analysis to another who doesn’t need to do a sample size calculation

Build clinical significance into the quantification of evidence for efficacy

Definitions

\(\Delta\): true treatment effect being estimated

\(\delta\): minimum clinically important \(\Delta\)

\(\gamma\): minimum detectable \(\Delta\) (threshold for non-trivial treatment effect, e.g. \(\frac{1}{3}\delta\))

SI: similarity interval for \(\Delta\), e.g., \([-\frac{1}{2}\delta, \frac{1}{2}\delta]\)

\(N\): average sample size, used for initial resource planning. This is an estimate of the ultimate sample size based on assuming \(\Delta = \delta\). \(N\) is computed to achieve a Bayesian power of 0.9 for efficacy, i.e., achieving a 0.9 probability at a fixed sample size that the probability of any efficacy exceeds 0.95 while the probability of non-trivial efficacy exceeds 0.85, i.e., \(\Pr(\Delta > 0) > 0.95\) and \(\Pr(\Delta > \gamma) > 0.85\).

More than one \(N\) can be computed, e.g., \(N\) such that if \(\Delta = \gamma\) there is ever a high probability of stopping for inefficacy

\(N_p\): minimum sample size for first efficacy assessment, based on required minimum precision of estimate of \(\Delta\)

Probability cutoffs are examples

This design does not require a separate futility analysis, as inefficacy subsumes futility

Below, “higher resolution Y” can come from breaking ties in the outcome variable (making it more continuous), adding longitudinal assessments, or extending follow-up

Example operating characteristics for this design are simulated in Section 9.9.

Bayesian inference opens up a wide variety of adaptive trials

stat methods need no adaptations

only sample size and Bayesian power calculations are complicated

But here consider non-adaptive trials with possible sequential monitoring

9.5 Sample Size Estimation

In a sense more honest than frequentist b/c uncertainty admitted

SDs

effect size

Whitehead et al. (2015): sample size calculation for control vs. set of active tx

goal: high PP that ≥ 1 tx better than control or

high PP that none is better

Other good references: Kunzmann et al. (2021), Spiegelhalter et al. (1993), Grouin et al. (2007), Spiegelhalter & Freedman (1986), Adcock (1997), Joseph & Bélisle (1997), Pezeshk & Gittins (2002), Simon & Freedman (1997), Wang et al. (2005), Braun (n.d.), this

Spiegelhalter et al. (1993) has clear rationale for unconditional power integrating over prior

“realistic assessment of the predictive probability of obtaining a ‘significant’ result”

discusses damage caused by using fixed optimistic treatment effect

Uebersax has an excellent article on Bayesian unconditional power analysis which also discusses the hazards of choosing a single effect size for power estimation

9.6 Bayesian Power and Sample Size Examples

Example notion of power: probability that PP will exceed a certain value such as 0.95 (Bayesian assurance)

Compute as a function of the unknown effect, see which true effects are detectable for a given sample size

If the minimum clinically important difference (MCID) is known, compute the success probability at this MCID

Better: take uncertainty into account by sampling the MCID from an uncertainty (prior) distribution for it, using a different value of MCID for each simulated trial. The Bayesian power (probability of success) from combining all the simulations averages over the MCID uncertainties. This is demonstrated in a time-to-event power calculation below.

9.6.1 One-Sample Mean Example

Consider earlier one-sample mean problem

Data are normal with variance 1.0 and the prior is normal with mean zero and variance \(\frac{1}{\tau}\)

\(\overline{Y}\) is normally distributed with mean \(\mu\) and variance \(\frac{1}{n}\)

Long-run average of \(\overline{Y}\) is \(\mu \rightarrow\) long-run median of \(\overline{Y}\) is also \(\mu \rightarrow\) median PP \(= \Phi(\frac{n \mu}{\sqrt{n + \tau}})\)

Let \(z = \Phi^{-1}(0.95)\)

Chance that PP exceeds 0.95 is \(P(\Phi(\frac{n \overline{Y}}{\sqrt{n + \tau}}) > 0.95) =
P(\frac{n \overline{Y}}{\sqrt{n + \tau}} > z) =
\Phi(\frac{n \mu - z \sqrt{n + \tau}}{\sqrt{n}})\)

For varying \(n\) and \(\tau\) the median PP and the Bayesian power are shown in the figure

In this example we compute the sample size needed to achieve adequate Bayesian power to achieve evidence for various assertions. The treatment effect is summarized by a hazard ratio (HR) in a proportional hazards (PH) model. Assertions for which the study needs to have the ability to provide conclusive evidence include

any reduction of hazard of an event (HR < 1)

HR < 0.98, 0.96, 0.94, …

harm or trivial benefit (HR > 0.95)

The trial’s actual analysis might use a flexible parametric PH model implemented in the rstanarm Bayesian survival analysis package. Bayesian operating characteristics are simulated here using the Cox PH model.

Take Bayesian power to be the probability that the posterior probability will exceed 0.975 when the true HR is a constant^{1}. For the first two types of assertions, set the HR generating the data to 0.75 (assumed minimally clinically important difference), and for the third type of assertion set HR to 1.25.

^{1} Ideally one would consider power over a distribution of HRs as done below.

The actual analysis to be used for the clinical trial should adjust for covariates, but covariates are ignored in our simulations. Bayesian power will increase with the addition of covariates.

If the treatment groups are designated A (standard of care) and B (new treatment), suppose that the following Weibull survival distribution has been found to fit patient’s experiences under control treatment.

\[S(t) = e^{- 0.9357 t^{0.5224}}\] The survival curve for treatment B patients is obtained by multiplying 0.9357 by the B:A hazard ratio (raising \(S(t)\) to the HR power).

Assume further that follow-up lasts 3y, 0.85 of patients will be followed a full 3y, and 0.15 of patients will be followed 1.5y when the final analysis is done^{2}.

^{2} The R Hmisc package spower function will simulate much more exotic situations.

To compute Bayesian power, use the normal approximation to the Cox partial likelihood function, i.e., assume that log(HR) is normal and use standard maximum likelihood methods to estimate the mean and standard deviation of the log(HR). The Cox log-likelihood is known to be very quadratic in the log HR, so the normal approximation will be very accurate. For a prior distribution assume a very small probability of 0.025 that HR < 0.25 and a probability of 0.025 that it exceeds 4. The standard deviation for log(HR) that accomplishes this is \(\frac{\log(4)}{1.96} = 0.7073 = \sigma\).

With a normal approximation for the data likelihood and a normal prior distribution for log(HR), the posterior distribution is also normal. Let \(\theta\) be the true log(HR) and \(\hat{\theta}\) be the maximum likelihood estimate of \(\theta\) in a given simulated clinical trial. Let \(s\) be the standard error of \(\hat{\theta}\)^{3}.

^{3} Note that in a proportional hazards model such as the one we’re using, the variance of \(\hat{\theta}\) is very close to the sum of the reciprocals of the event frequencies in the two groups.

Let \(\tau = \frac{1}{\sigma^{2}} = 1.9989\) and let \(\omega = \frac{1}{s ^ {2}}\). Then the approximate posterior distribution of \(\theta\) given the data and the prior is then normal with mean \(\mu\) and variance \(v\) where \(v = \frac{1}{\tau + \omega}\) and \(\mu = \frac{\omega \hat{\theta}}{\tau + \omega}\).

The posterior probability that the true HR < \(h\) is the probability that \(\theta < \log(h)\), which is \(\Phi(\frac{\log(h) - \mu}{\sqrt{v}})\).

Simulation

First consider code for generating outcomes for one clinical trial of size \(n\) per group, follow-up time cutoff \(u\), secondary cutoff time \(u2\) with probability \(pu2\), and true hazard ratio \(H\). The result is posterior mean and variance and the P-value from a frequentist likelihood ratio \(\chi^2\) test with the Cox PH model.

Code

require(data.table)require(ggplot2)require(rms)require(survival)sim1 <-function(n, u, H, pu2=0, u2=u, rmedian=NA) {# n: number of pts in each group# u: censoring time# H: hazard ratio# pu2: probability of censoring at u2 < u# rmedian: ratio of two medians and triggers use of log-normal model# Generate failure and censoring times for control patients# Parameters of a log-normal distribution that resembles the Weibull# distribution are below (these apply if rmedian is given) mu <--0.2438178 sigma <-1.5119236 ta <-if(is.na(rmedian)) (-log(runif(n)) /0.9357) ^ (1.0/0.5224)elseexp(rnorm(n, mu, sigma)) ct <-pmin(u, ifelse(runif(n) < pu2, u2, u)) ea <-ifelse(ta <= ct, 1, 0) ta <-pmin(ta, ct) # apply censoring# Generate failure and censoring times for intervention patients tb <-if(is.na(rmedian)) (-log(runif(n)) / (H *0.9357)) ^ (1.0/0.5224)elseexp(rnorm(n, mu +log(rmedian), sigma)) ct <-pmin(u, ifelse(runif(n) < pu2, u2, u)) eb <-ifelse(tb <= ct, 1, 0) tb <-pmin(tb, ct) ti <-c(ta, tb) ev <-c(ea, eb) tx <-c(rep(0, n), rep(1, n)) f <-if(is.na(rmedian)) coxph(Surv(ti, ev) ~ tx) elsesurvreg(Surv(ti, ev) ~ tx, dist='lognormal') chi <-2*diff(f$loglik) theta.hat <-coef(f)[length(coef(f))] s2 <-vcov(f)['tx', 'tx'] tau <-1.9989 omega <-1.0/ s2 v <-1.0/ (tau + omega) mu <- omega * theta.hat / (tau + omega)if(!all(c(length(mu), length(v), length(chi))) ==1) stop("E")list(mu =as.vector(mu),v =as.vector(v),pval =as.vector(1.0-pchisq(chi, 1)))}

Simulate all statistics for \(n=50, 51, ..., 1000\) and HR=0.75 and 1.25, with 10 replications per \(n\)/HR combination.

The number of trials simulated for each of the two HR’s was 9510.

Before simulating Bayesian power, simulate frequentist power for the Cox PH likelihood ratio \(\chi^2\) test with two-sided \(\alpha=0.05\). Effect size to detect is HR=0.75.

Code

R <- S[HR ==0.75, .(reject =1* (pval <0.05)), by=n]# Compute moving proportion of trials with p < 0.05# Use overlapping windows on n and also logistic regression with# a restricted cubic spline in n with 5 knotsm <-movStats(reject ~ n, eps=100, lrm=TRUE,data=R, melt=TRUE)ggplot(m,aes(x=n, y=reject, linetype=Type)) +geom_line() +geom_hline(yintercept=0.9, alpha=0.3) +scale_x_continuous(breaks=seq(0, 1000, by=100)) +scale_y_continuous(breaks=seq(0, 1, by=0.1),minor_breaks=seq(0, 1, by=0.05)) +ylab('Pr(reject H0:HR=1)')

Use linear interpolation to solve for the sample size needed to have 0.9 frequentist power.

Code

lookup <-function(n, p, val=0.9)round(approx(p, n, xout=val)$y)m[Type =='LR', lookup(n, reject)]

[1] 333

For each trial simulated with HR=0.75, compute posterior probabilities that the true HR \(< h\) for various HR cutoffs \(h\). Both moving proportions and logistic regression with a restricted cubic spline in \(n\) are used to smooth the results to estimate the relationship between \(n\) and Bayesian power, i.e., the probability that the posterior probability reaches 0.975 or higher.

Code

R <- S[HR ==0.75]hs <-seq(0.8, 1.0, by =0.02)pp <-function(mu, v) list(h=hs, p=pnorm((log(hs) - mu) /sqrt(v)))R <- R[, pp(mu, v), by=.(n, rep)]# Compute moving proportion with posterior prob > 0.975# with overlapping windows on n, and separately by hR[, phi :=1* (p >=0.975)]m <-movStats(phi ~ n + h, eps=100, lrm=TRUE, lrm_args=list(maxit=30),data=R, melt=TRUE)ggplot(m,aes(x=n, y=phi, color=h, linetype=Type)) +geom_line() +geom_hline(yintercept=0.9, alpha=0.3) +scale_x_continuous(breaks=seq(0, 1000, by=100)) +scale_y_continuous(breaks=seq(0, 1, by=0.1),minor_breaks=seq(0, 1, by=0.05)) +ylab('Pr(Posterior Pr(HR < h) > 0.975)')

Use linear interpolation to compute the \(n\) for which Bayesian power is 0.9.

Code

u <- m[Type =='LR', .(nmin=lookup(n, phi)), by=h]u

h nmin
<char> <num>
1: 0.8 NA
2: 0.82 NA
3: 0.84 NA
4: 0.86 NA
5: 0.88 NA
6: 0.9 866
7: 0.92 687
8: 0.94 584
9: 0.96 489
10: 0.98 394
11: 1 339

Code

Nprime <- u[h ==1, nmin]

Check the Bayesian power to detect a trivial or worse treatment benefit when true HR is 1.25.

Code

HR <-1.25hs <-0.95R <- S[HR ==1.25, pp(mu, v), by=.(n, rep)]# One minus because we want Pr(HR > 0.95)R[, phi :=1* (1.0- p >0.975)]m <-movStats(phi ~ n, eps=100, lrm=TRUE, data=R, melt=TRUE,lrm_args=list(maxit=30))ggplot(m, aes(x=n, y=phi, linetype=Type)) +geom_line() +ylab('Pr(Posterior Pr(HR > 0.975))') +geom_hline(yintercept=0.9, alpha=0.3) +scale_x_continuous(breaks=seq(0, 1000, by=100)) +scale_y_continuous(breaks=seq(0, 1, by=0.1),minor_breaks=seq(0, 1, by=0.05))

Find \(n\) where it crosses 0.9.

Code

m[Type =='LR', lookup(n, phi)]

[1] 328

Accounting for Uncertainty in MCID

Different reviewers and journal readers may have different ideas of a minimal clinically important difference to detect. Instead of setting the MCID to HR=0.75, suppose that all HR in the interval [0.65, 0.85] are equally likely to be the MCID at the time the study findings are made known. Let’s re-simulate 1000 trials to compute Bayesian power at a sample size of 339 per group, varying the MCID over [0.65, 0.85].

The Bayesian power drops from 0.9 to 0.848 when the stated amount of uncertainty is taken into account for MCID.

9.6.3 Bayesian Power for 3-Level Ordinal Outcome

Suppose that the above time-to-event outcome is primary, and that there is a secondary outcome of infection within 90 days of randomization. To explicitly account for deaths within the 90-day window we use a 3-level proportional odds ordinal logistic model with 0=alive and infection-free, 1=alive and infection, 2=dead (whether or not preceded by infection). For the actual study we may use a partial proportional odds model to allow for a different treatment effect for the union of infection and death as compared to the effect for death alone. A prior distribution can be specified for the ratio of the two odds ratios (for infection + death and for death) such that there is a 0.95 chance that the ratio of odds ratios is between \(\frac{1}{3}\) and \(3\). A Bayesian power simulation with that amount of borrowing between the treatment effect for the two outcomes is presented at the end of this section.

Use the Weibull model described above to estimate 90-day mortality.

Code

rnd <-function(x) round(x, 3)S <-function(t) exp(-0.9357* t ^0.5224)m90 <-1.-S(90/365.25)m30 <-1.-S(30/365.25)

The estimated mortality by 90d is 0.362. The estimated probability of infection for 90-day survivors is 0.2, so the ordinal outcome is expected to have relative frequencies listed in the first line below.

Code

# Hmisc posamsize and popower function cell probabilities are averaged# over the two groups, so need to go up by the square root of the odds # ratio to estimate these averagesphigh <-c(1.-0.2- m90, 0.2, m90)or <-0.65pmid <-pomodm(p=phigh, odds.ratio=sqrt(or))plow <-pomodm(p=pmid, odds.ratio=sqrt(or))phigh <-rnd(phigh)pmid <-rnd(pmid)plow <-rnd(plow)n <-round(posamsize(pmid, odds.ratio=or, power=0.8)$n /2)Infpow <-round(popower(pmid, odds.ratio=or, n=2* Nprime)$power, 3)Infpow

[1] 0.844

Treatment

Y=0

Y=1

Y=2

Standard Tx

0.438

0.2

0.362

New Tx

0.545

0.185

0.27

Average

0.491

0.195

0.314

The second line in the table are the outcome probabilities when an odds ratio of 0.65 is applied to the standard treatment probabilities. The last line is what is given to the posampsize and popower functions to compute the sample size need for an \(\alpha=0.05\) two-sided Wilcoxon/proportional odds test for differences in the two groups on the 3-level outcome severity with power of 0.8. The needed sample size is 301 per group were a frequentist test to be used. The frequentist power for the sample size (339 per group) that will be used for the primary outcome is 0.844.

For Bayesian power we use a process similar to what was used for the proportional hazards model above, but not quite trusting the quadratic approximation to the log likelihood for the proportional odds model, due to discretenesss of the outcome variable. Before simulating a power curve for all sample sizes using the normal approximation to the data likelihood, we do a full Bayesian simulation of 1000 trials with 300 patients per group in addition to using the approximate normal prior/normal likelihood method.

Here is a function to simulate one study with an ordinal outcome.

Code

require(rmsb)options(mc.cores=4, rmsb.backend='cmdstan')cmdstanr::set_cmdstan_path(cmdstan.loc)sim1 <-function(n, pcontrol, or, orcut=1, bayes=FALSE) {# n: number of pts in each group# pcontrol: vector of outcome probabilities for control group# or: odds ratio# orcut: compute Pr(OR < orcut)# Compute outcome probabilities for new treatment ptr <-pomodm(p=pcontrol, odds.ratio=or) k <-length(pcontrol) -1 tx <-c(rep(0, n), rep(1, n)) y <-c(sample(0: k, n, replace=TRUE, prob=pcontrol),sample(0: k, n, replace=TRUE, prob=ptr) ) f <-lrm(y ~ tx) chi <- f$stats['Model L.R.'] theta.hat <-coef(f)['tx'] s2 <-vcov(f)['tx', 'tx'] tau <-1.9989 omega <-1.0/ s2 v <-1.0/ (tau + omega) mu <- omega * theta.hat / (tau + omega)if(!all(c(length(mu), length(v), length(chi))) ==1) stop("E") postprob <-NULLif(bayes) {# Set prior on the treatment contrast pcon <-list(sd=0.7073, c1=list(tx=1), c2=list(tx=0),contrast=expression(c1 - c2) ) f <-blrm(y ~ tx, pcontrast=pcon,loo=FALSE, method='sampling', iter=1000) dr <- f$draws[, 'tx'] postprob <-mean(dr <log(orcut)) # Pr(beta<log(orcut)) = Pr(OR<orcut) }c(mu =as.vector(mu),v =as.vector(v),pval =as.vector(1.0-pchisq(chi, 1)),postprob = postprob)}# Define a function that can be run on split simulations on# different coresg <-function() { nsim <-1000# Function to do simultions on one core run1 <-function(reps, showprogress, core) { R <-matrix(NA, nrow=reps, ncol=4,dimnames=list(NULL, c('mu', 'v', 'pval', 'postprob')))for(i in1: reps) {showprogress(i, reps, core) R[i, ] <-sim1(300, phigh, 0.65, bayes=TRUE) }list(R=R) }# Run cores in parallel and combine results# 3 cores x 4 MCMC chains = 12 cores = machine maximumrunParallel(run1, reps=nsim, seed=1, along=1, cores=3)}R <-runifChanged(g, file='k3.rds') # 15 min.

Compute frequentist power and exact (to within simulation error) and approximate Bayesian powper.

h <-function(x) c(Power =round(mean(x), 3))u[, rbind('Frequentist'=h(pval <0.05),'Bayesian'=h(postprob >0.975),'Approximate Bayesian'=h(approxbayes >0.975))]

Power
Frequentist 0.795
Bayesian 0.782
Approximate Bayesian 0.780

The posterior distribution is extremely well approximated by a normal distribution at a sample size that matters. Bayesian power is just slightly below frequentist power because frequentist test statistics do not place limits on the treatment effect, e.g., allow the odds ratio to be unrealistically far from 1.0.

Use the approximation for proportional odds model Bayesian power going forward. Compute power for the primary endpoint’s sample size.

The Bayesian power to detect an OR of 0.65 at a sample size of 339 per group is 0.811.

Bayesian Power for Different Treatment Effect for Infection and Death

The proportional odds (PO) model can provide a correct assessment of which treatment is better overall even when PO is not satisfied. But with non-PO the power will not be optimal, and estimates of outcome probabilities for individual outcomes can be off. To estimate Bayesian power when PO is not assumed, allow the treatment effect (OR) for I|D (infection unioned (“or”d) with death) to differ from the OR for D. In the \(Y=0,1,2\) notation we have one OR for \(Y\geq 1\) and a separate OR for \(Y=2\). This is the partial PO model of Peterson & Harrell (1990). The amount of borrowing for the I|D treatment effect and the D effect is specified by a prior for the ratio of the I|D and D ORs. We assume the log of the ratio of the ORs is normal with mean zero and variance such that the probability that the ratio of ratios is between \(\frac{1}{3}\) and \(3\) is 0.95. The standard deviation accomplishing this is \(\frac{\log(3)}{1.96} = 0.5605\).

Also simulate the power of the treatment comparison on I|D when there is no borrowing of treatment effect information across I and D, i.e., when the prior for the non-PO effect (here the special effect of treatment on D) is non-informative. This partial PO model assumes nothing about how the treatment effect on I|D compares to the effect on D.

The sample size is 302 per group to yield Bayesian power of 0.8 against OR=0.65 in PO holds. For this sample size, simulations showed the loss of power by not assuming PO was minimal when the data satisfied PO. So the simulations are run with \(n=75\) per group to better understand power implications for varying amounts of information borrowing.

The R rmsb package’s blrm function has the user specify priors for departures from PO through contrasts and through a “departure from PO” function (the cppo argument). The contrast for the differential treatment effect over outcome components in the information-borrowing case is the same one that we apply to the main treatment effect except for using a standard deviation involving \(\log(3)\) instead of \(\log(4)\).

When the statistical analysis plan for the study is detailed, more simulations will be run that use a non-PO data generating mechanism, but for now data will be simulated the same as earlier, assuming PO for treatment.

Code

sim1npo <-function(n, pcontrol, or, orcut=1, peffect=FALSE) {# n: number of pts in each group# pcontrol: vector of outcome probabilities for control group# or: odds ratio to detect for I|D# orcut: Compute Pr(OR < orcut)# Set peffect=TRUE to debug# Compute outcome probabilities for new treatment ptr <-pomodm(p=pcontrol, odds.ratio=or) k <-length(pcontrol) -1 tx <-c(rep(0, n), rep(1, n)) y <-c(sample(0: k, n, replace=TRUE, prob=pcontrol),sample(0: k, n, replace=TRUE, prob=ptr) )# Set prior on the treatment contrast pcon <-list(sd=0.7073, c1=list(tx=1), c2=list(tx=0),contrast=expression(c1 - c2) )# PO model first f <-blrm(y ~ tx, pcontrast=pcon,loo=FALSE, method='sampling', iter=1000) g <-blrm(y ~ tx, y ~ tx, cppo=function(y) y ==2,pcontrast=pcon,loo=FALSE, method='sampling', iter=1000)# Set prior on the "treatment by Y interaction" contrast npcon <- pcon npcon$sd <-0.5605 h <-blrm(y ~ tx, y ~ tx, cppo=function(y) y ==2,pcontrast=pcon, npcontrast=npcon,loo=FALSE, method='sampling', iter=1000)if(peffect) {print(f)print(g)print(h)print(rbind(po=c(coef(f), NA), flat=coef(g), skeptical=coef(h)))return() } df <- f$draws[, 'tx'] # log OR for I|D dg <- g$draws[, 'tx'] # = log OR for D for PO dh <- h$draws[, 'tx']# Pr(beta < 0) = Pr(OR < 1) lor <-log(orcut)c(po =mean(df < lor), flat =mean(dg < lor), skeptical =mean(dh < lor))}g <-function() { nsim <-1000# Function to do simultions on one core run1 <-function(reps, showprogress, core) { R <-matrix(NA, nrow=reps, ncol=3,dimnames=list(NULL, c('po', 'flat', 'skeptical')))for(i in1: reps) {showprogress(i, reps, core) R[i,] <-sim1npo(N, phigh, 0.65) }list(R=R) }# Run cores in parallel and combine results# 3 cores x 4 MCMC chains = 12 cores = machine maximumrunParallel(run1, reps=nsim, seed=11, cores=3, along=1)}N <-75R <-runifChanged(g, sim1, N, file='k3npo.rds')bp <-apply(R, 2, function(x) mean(x >=0.975))bp

po flat skeptical
0.228 0.203 0.208

For \(n=75\) per group, the Bayesian power is 0.228 when completely borrowing information from D to help inform the treatment effect estimate for D|I. Allowing for arbitrarily different effects (non-PO) reduces the power to 0.203, and assuming some similarity of effects yields a power of 0.208.

9.6.4 Bayesian Power for 6-Level Ordinal Non-Inferiority Outcome

Suppose that the 6-level ECOG performance status assessed at 30 days is another secondary endpoint, and is considered to be a non-inferiority endpoint. Assessment of non-inferiority is more direct with a Bayesian approach and doesn’t need to be as conservative as a frequentist analysis, as we are assessing evidence for a broader range of treatment effects that includes slight harm in addition to the wide range of benefits.

The outcome will be analyzed using a Bayesian proportional odds model with the same prior on the log odds ratio as has been used above. The 6 cell probabilities assumed for the control group appear in the top line in the table below.

The estimated mortality by 30d is 0.224.

Code

# Start with relative probabilities of Y=0-4 and make them# sum to one minus 30d mortalityphigh <-c(42, 41, 10, 1, 1) /100phigh <-c(phigh /sum(phigh) * (1.- m30), m30)or <-0.65pmid <-pomodm(p=phigh, odds.ratio=sqrt(or))plow <-pomodm(p=pmid, odds.ratio=sqrt(or))n <-round(posamsize(pmid, odds.ratio=or, power=0.8)$n /2)Perfpowf <-popower(pmid, odds.ratio=or, n = Nprime *2)$powerg <-function(x) {names(x) <-paste0('Y=', (1:length(x)) -1)round(x, 3)}rbind('Standard treatment'=g(phigh),'New treatment'=g(plow),'Average'=g(pmid) )

Y=0 Y=1 Y=2 Y=3 Y=4 Y=5
Standard treatment 0.343 0.335 0.082 0.008 0.008 0.224
New treatment 0.446 0.319 0.065 0.006 0.006 0.158
Average 0.393 0.330 0.074 0.007 0.007 0.189

The second line in the table are the outcome probabilities when an odds ratio of 0.65 is applied to the standard treatment probabilities. The last line is what is given to the posampsize function to compute the sample size need for an \(\alpha=0.05\) two-sided Wilcoxon / proportional odds test for differences in the two groups on the 6-level outcome severity with power of 0.8. This is also used for popower. The needed sample size is 283 per group were a frequentist test to be used. This is for a superiority design. The frequentist power for 339 per group is 0.864524.

For performance status a non-inferiority design is used, so Bayesian power will differ more from frequentist superiority power, as the Bayesian posterior probability of interest is \(\Pr(\text{OR} < 1.2)\). Simulate Bayesian power using the normal approximation.

Bayesian power to detect an OR of 0.65 when seeking evidence for OR < 1.2 with 339 patients per group is 0.987.

9.7 Sample Size Needed For Prior Insensitivity

One way to choose a target sample size is to solve for \(n\) that controls the maximum absolute difference in posterior probabilities of interest, over the spectrum of possible-to-observe data. Let

\(n\) = sample size in each of two treatment groups

\(\Delta\) = unknown treatment effect expressed as a difference in means \(\mu_{1} - \mu_{2}\)

the data have a normal distribution with variance 1.0

\(\hat{\Delta}\) = observed difference in sample means \(\overline{Y}_{1} - \overline{Y}_{2}\)

\(V(\hat{\Delta}) = \frac{2}{n}\)

Frequentist test statistic: \(z = \frac{\hat{\Delta}}{\sqrt{\frac{2}{n}}}\)

\(\delta\) assumed to be the difference not to miss, for frequentist power

One-sided frequentist power for two-sided \(\alpha\) with \(z\) critical value \(z_{\alpha}\): \(\Pr(z > z_{\alpha}) = 1 - \Phi(z_{\alpha} - \frac{\delta}{\sqrt{\frac{2}{n}}})\)

Assume skeptical prior for \(\Delta\) that is normal with mean 0 and variance \(\frac{1}{\tau}\)

Posterior distribution for \(\Delta | \hat{\Delta}\) is normal

variance = \(\frac{1}{\tau + \frac{n}{2}}\)

mean = \(\frac{n\hat{\Delta}}{n + 2\tau}\)

Code

pp <-function(tau, n, delta, Deltahat)round(1-pnorm((delta - n * Deltahat / (n +2* tau)) *sqrt(tau + n /2)), 4)# Compute prior SD such that P(Delta > delta) = probpsd <-function(delta, prob) round(delta /qnorm(1- prob), 4)

Example: Probability of misleading evidence for efficacy

agreed-upon prior was such that \(\Pr(\Delta > 0.5) = 0.05 \rightarrow\) SD of the prior is 0.304

someone disagrees, and has their own posterior probability of efficacy based on their prior, which was chosen so that \(\Pr(\Delta > 0.5) = 0.005 \rightarrow\) SD of their prior is 0.1941

suppose that \(\hat{\Delta}\) at \(n=100\) is 0.3

\(\rightarrow\) posterior probability of \(\Delta > 0\) using pre-specified prior is 0.9801

\(\rightarrow\) posterior probability of \(\Delta > 0\) using the critic’s prior is 0.9783

there is substantial evidence for efficacy in the right direction

probability of being wrong in acting as if the treatment works is 0.0199 to those who designed the study, and is 0.0217 to the critic

the two will disagree more about evidence for a large effect, e.g. \(\Pr(\Delta > 0.2)\)

\(\Pr(\Delta > 0.2)\) using pre-specified prior: 0.724

\(\Pr(\Delta > 0.2)\) for the critic: 0.7035

Probabilities of being wrong in acting as if the treatment effect is large: 0.276, 0.2965

Sample size needed to overcome effect of prior depends on the narrowness of the prior (\(\tau; \sigma\))

for posterior mean and variance to be within a ratio \(r\) of those from a flat prior \(n\) must be \(> \frac{2\tau r}{1 - r}\)

within 5% for \(n > 38\tau\)

within 2% for \(n > 98\tau\)

within 1% for \(n > 198\tau\)

typical value of \(\sigma = \frac{1}{\tau} = 0.5\) so \(n > 196\) to globally be only 2% sensitive to the prior

Choose different values of the prior SD to examine sensitivity of cumulative posterior distribution to prior

Below see that the posterior probability of any treatment benefit (when \(\delta=0\)) is not sensitive to the prior

Probability of benefit \(> \delta\) for large \(\delta\) is sensitive to the prior

priors having large probability of large \(\Delta\) lead to higher posterior probabilities of large \(\Delta > \delta\) when \(\delta\) is large

For each \(n, \delta\), and SDs compute the absolute difference in posterior probabilities that \(\Delta > \delta\)

First look at the result when assessing evidence for any efficacy (\(\delta=0\)) when the most extreme priors (\(\sigma=\frac{1}{8}\) vs. \(\sigma=8\)) are compared

Don’t consider conditions when both posterior probabilities are in [0.1, 0.9] where disagreements are probably inconsequential

Code

require(data.table)sigmas <-c(1/8, 8)w <-expand.grid(n =c(1:30, seq(35, 200, by=5)),Deltahat =seq(-1, 5, by=0.025))setDT(w)z <-lapply(1:nrow(w),function(i) w[i, .(n, Deltahat,p1 =pp(8, n, 0, Deltahat),p2 =pp(1/8, n, 0, Deltahat) )] )u <-rbindlist(z)# Ignore conditions where both posterior probabilities are in [0.1, 0.9]# Find maximum abs difference over all Deltahatm <- u[, .(mdif =max(abs(p1 - p2) * (p1 <0.1| p2 <0.1| p1 >0.9| p2 >0.9))), by=n]ggplot(m, aes(x=n, y=mdif)) +geom_line() +scale_x_continuous(minor_breaks=seq(0, 200, by=10)) +scale_y_continuous(minor_breaks=seq(0, 0.2, by=0.01)) +ylab(expression(paste('Maximum |Difference| in ', Pr(Delta > delta ~"|"~hat(Delta)))))

By \(n=170\) the largest absolute difference in posterior probabilities of any efficacy is 0.01

The difference is 0.02 by \(n=80\)

Sample sizes needed for frequentist power of 0.9 are \(n=38\) for \(\Delta=0.5\) and \(n=85\) for \(\Delta=0.25\)

Now consider the full picture where evidence for efficacy \(> \delta\) where \(\delta \neq 0\) is included

Vary \(n, \delta, \hat{\Delta}\) and the two prior SDs

Don’t show points where the difference < 0.0025 or when both posterior probabilities are in [0.1, 0.9]

In the graph one of the prior SDs is going across, and the other is vertical; diagonals where the two SDs are equal are not needed

Code

# For a given n and delta two values of SDs compute the absolute difference# of the two posterior probabilities that Delta > delta when one of the posterior# probabilities was < 0.1 or > 0.9 and when the difference exceeds 0.0025adif <-function(n, delta, s1, s2) { tau1 <-1/ s1 tau2 <-1/ s2 Deltahat <-seq(-1, 5, by=0.05) p1 <-pp(tau1, n, delta, Deltahat) p2 <-pp(tau2, n, delta, Deltahat) i <-abs(p1 - p2) >=0.0025& (pmin(p1, p2) <0.1|pmax(p1, p2) >0.9)data.table(n, delta, s1, s2, Deltahat, ad=abs(p1 - p2))[i]}taus <-c(1/8, 1/4, 1/2, 1, 2, 4, 8)sigmas <-1/ tausdels <-seq(-1, 1, by=0.5)w <-expand.grid(n =c(10, 15, seq(20, 400, by=10)),delta = dels,s1 = sigmas,s2 = sigmas)setDT(w)w <- w[s2 > s1]z <-lapply(1:nrow(w), function(i) w[i, adif(n, delta, s1, s2)])u <-rbindlist(z)u[, adc :=cut2(ad, c(0, 0.005, 0.01, 0.02, 0.03, 0.04, 0.05, 0.075, 0.1, 0.15, 0.2, 0.4))]g <-vector('list', length(dels))names(g) <-paste0('δ=', dels)for(del in dels) g[[paste0('δ=', del)]] <-ggplot(u[delta==del], aes(x=n, y=Deltahat, color=adc)) +geom_point(size=0.3) +facet_grid(s1 ~ s2) +guides(color=guide_legend(title='|Difference|')) +theme(legend.position='bottom') +ylab(expression(hat(Delta))) +labs(caption=bquote(delta == .(del)))maketabs(g, initblank=TRUE)

I want to make it quite clear that I have no objection with flexible designs where they help us avoid pursuing dead-ends. The analogous technique from the sequential design world is stopping for futility. In my opinion of all the techniques of sequential analysis, this is the most useful. — Senn (2013)

Compute PP as often as desired, even continuously

Formal Bayesian approach can also consider having adequate safety data

Stop when PP > 0.95 for any efficacy and PP > 0.95 for safety event hazard ratio ≤ 1.2

Spiegelhalter et al. (1993) argue for use of ‘range of equivalence’ in monitoring

Bayesian design is very simple if no cap on n

But often reach a point where clinically important efficacy unlikely to be demonstrated within a reasonable n

Futility is more easily defined with Bayes. 3 possible choices:

Stop when P(efficacy < ε) is > 0.9

Use posterior predictive distribution at planned max sample size (Berry (2006))

Pre-specify max n and continue until that n if trial not stopped early for efficacy or harm (more expensive)

Insightful approximation of Spiegelhalter et al. (1993)

The published report of any study attempts to answer the crucial question: What does this trial add to our knowledge? The strength of the Bayesian approach is that it allows one to express the answer formally. It therefore provides a rational framework for dealing with many crucial issues in trial design, monitoring and reporting. In particular, by making explicitly different prior opinions about efficacy, and differing demands on new therapies, it may shed light on the varying attitudes to balancing the clinical duty to provide the best treatment for individual patients, against the desire to provide sufficient evidence to convince the general body of clinical practice. — Spiegelhalter et al. (1993)

9.9 Example Operating Characteristics

Even with a disadvantageous mismatch between the prior used in the analysis and the prior used to generate the universe of efficacies, the design was able to stop early for evidence of non-trivial efficacy in 0.401 of 10,000 trials over a skeptical distribution of true efficacies, and was right 0.91 of the time when doing so. The design was right 0.98 of the time in terms of there being truly some efficacy when stopping for more than trivial efficacy. Stopping early for evidence of less than trivial efficacy was possible in 0.518 of the trials and was the correct decision 0.97 of the time. One quarter of such trials stopped with fewer than 14 participants studied and half had a sample size less than 32. Overall the average sample size at the time of stopping was 138 for the most skeptical prior, which is just over ½ what a traditional frequentist design would require (234) just to have power to detect the MCID, still not providing evidence for non-trivial efficacy, inefficacy, or similarity. With the Bayesian sequential design only 0.024 of the trials ended without a conclusion for the most skeptical prior.

For the Bayesian sequential design at the start of this chapter, simulate the operating characteristics for the following setting.

Two-arm parallel-group RCT with equal sample sizes per group

Response variable is normal with standard deviation 1.0

Estimand of interest is \(\Delta\) = the unknown difference in means

MCID is \(\delta=0.3\)

Minimum clinically detectable difference is \(\gamma=0.1\)

Similarity interval is \(\Delta \in [-0.15, 0.15]\)

Maximum affordable sample size is 750 per treatment group

Required minimum precision in estimating \(\Delta\): \(\pm 0.4\) (half-width of 0.95 credible interval)

\(\rightarrow\) minimum sample size \(N_{p}\) before stopping for efficacy

Allow stopping for inefficacy at any time, even if the treatment effect cannot be estimated after stopping, i.e., one only knows there is a high probability that the effect \(< \gamma\).

The first analysis takes place after the first participant has been enrolled and followed, then after each successive new participant’s data are complete. The first look for efficacy is delayed until the \(N_{p}^\text{th}\) participant has been followed so that the magnitude of the treatment effect has some hope of being estimated.

Prior for \(\Delta\) is normal with mean 0 and SD \(\sigma\) such that \(\Pr(\Delta > 0.75) = \kappa\) where \(\kappa\) is varied over \({0.2, 0.1, 0.05, 0.025}\)

This corresponds to \(\sigma=\) 0.8911, 0.5852, 0.456, 0.3827

These are skeptical priors, chosen because the treatment is assumed to be an incremental rather than a curative therapy

With these similarity limits a large number of studies were stopped early for similarity. In one test with a narrower similarity interval, 0.035 of simulated studies were stopped early for similarity.When using an ultra-high information continuous response variable as in this example, with the between-subject standard deviation known to be 1.0, the data having a normal distribution, and not delaying the first look, it is possible to make a decision with any one participant if she has a very impressive response. This could never happen with a binary endpoint or with unknown SD.

The assumed data model leads to a frequentist point estimate of \(\Delta\) = \(\hat{\Delta}\) = difference in two means, with variance \(\frac{2}{n}\), where \(n\) is the per-group sample size. The one-sided frequentist power for a two-tailed test with \(\alpha=0.05\) is \(1 - \Phi(1.96 - \frac{0.3}{\sqrt{\frac{2}{n}}})\), and setting this to 0.9 requires \(n=2[\frac{1.96 + 1.282}{0.3}]^2 = 234\). 1.282 is \(\Phi^{-1}(0.9)\).

To arrive at \(N_{p}\), the minimum sample size needed to achieve non-embarrassing precision in estimating \(\Delta\), keep in mind that skeptical priors for \(\Delta\) push the posterior distribution for \(\Delta\) towards zero, resulting in narrower posterior distributions. To be conservative in estimating \(N_{p}\) use a flat prior. The 0.90 credible interval is \(\hat{\Delta} \pm 1.645\sqrt{\frac{2}{n}}\). Setting the half-length of the interval to \(\delta=0.3\) results in \(n = N_{p} = 2(\frac{1.645}{0.3})^2 = 60\) per group.

Were a margin of error of 0.2 been required, \(N_{p}=135\), i.e., the first look for possible stoppage for efficacy would be at the \(135^\text{th}\) participant. Setting the required precision to 0.085 would result in no early stopping for efficacy, which is a choice that some trial designers may make, but it requires massive justification for the choice of the maximum sample size.

At a given interim look with a cumulative difference in means of \(\hat{\Delta}\) the posterior distribution of \(\Delta | \hat{\Delta}\) is normal with variance \(\frac{1}{\tau + \frac{n}{2}}\) and mean \(\frac{n\hat{\Delta}}{n + 2\tau}\) where \(\tau\) is the prior precision, \(\frac{1}{\sigma^2}\). Here is a function to compute the posterior probability that \(\Delta > \delta\) given the current difference in means \(\hat{\Delta}\).

Code

pp <-function(delta, Deltahat, n, kappa) { sigma <-psd(0.75, kappa) tau <-1./ sigma ^2 v <-1./ (tau + n /2.) m <- n * Deltahat / (n +2.* tau)1.-pnorm(delta, m, sqrt(v))}

A simple Bayesian power calculation for a single analysis after \(n\) participants are enrolled in each treatment group, and considering only evidence for any efficacy, sets the probability at 0.9 that the posterior probability of \(\Delta > 0\) exceeds 0.95 when \(\Delta=\delta\). The posterior probability PP that \(\Delta > 0\) is \(\Phi(\frac{n\hat{\Delta}/\sqrt{2}}{\sqrt{n + 2\tau}})\). Making use of \(\hat{\Delta} \sim n(\delta, \frac{2}{n})\), the probability that PP > 0.95 is \(1 - \Phi(\frac{za - \delta}{\sqrt{\frac{2}{n}}})\) where \(z = \Phi^{-1}(0.95)\) and \(a=\frac{\sqrt{2}\sqrt{n + 2\tau}}{n}\).

Code

# Special case when gamma=0, for checkingbpow <-function(n, delta=0.3, kappa=0.05) { tau <-1./psd(0.75, kappa) ^2 z <-qnorm(0.95) a <-sqrt(2) *sqrt(n +2* tau) / n1.-pnorm((z * a - delta) /sqrt(2/ n))}bpow <-function(n, delta=0.3, gamma=0, kappa=0.05, ppcut=0.95) { tau <-1./psd(0.75, kappa) ^2 z <-qnorm(ppcut) a <- n +2.* tau1.-pnorm(sqrt(n /2) * ((a / n) * (gamma + z *sqrt(2/ a) ) - delta))}findx <-function(x, y, ycrit, dec=0) round(approx(y, x, xout=ycrit)$y, dec)R$bayesN <-round(findx(10:500, bpow(10:500), 0.9))bpow(R$bayesN)

[1] 0.9006247

Setting the Bayesian power to 0.9 results in \(196\) when \(\kappa=0.05\). Contrast this with the frequentist sample size of 234.

To understand the stringency of adding the requirement that efficacy is non-trivial to the requirement of having a very high probability that any efficacy exists, consider a sample size (45) and current mean difference (0.3) where the probability of any efficacy is 0.9. Compute the probability of non-trivial efficacy with that mean difference. The prior probability of a very large (0.75) or larger treatment effect is taken as \(\kappa=0.05\).

Code

pp( 0, 0.3, 45, 0.05)

[1] 0.9017632

Code

pp(0.1, 0.3, 45, 0.05)

[1] 0.7790777

Code

pp( 0, 0.36, 45, 0.05)

[1] 0.9394288

Code

pp(0.1, 0.36, 45, 0.05)

[1] 0.8478874

At the point at which there is over 0.9 chance of any efficacy in the right direction, the probability of non-trivial efficacy is 0.78. To achieve a probability of 0.85 would require an observed mean difference of 0.36 which raises the probability of any efficacy to 0.94.

What if instead we wanted 0.9 Bayesian power to achieve a PP > 0.85 that \(\Delta > \gamma\), the threshold for a detectable treatment effect? Let’s compute over a range of \(n\) the two relevant probabilities: of getting \(\Pr(\Delta > 0) > 0.95\) and of getting \(\Pr(\Delta > 0.10) >0.85\), and also compute Bayesian power for \(\Pr(\Delta > 0.15) > 0.8\).

# Solve for n to get Bayesian power of 0.9p <- w[, .(n=findx(n, p, 0.9)), by=.(crit, Criterion)]navg <-if(length(R$stopn)) R$stopn[conclusion =='non-trivial efficacy'& kappa==0.001, Mean]

For a fixed sample-size study the necessary \(N\) to have a 0.9 chance that the posterior probability of more than trivial efficacy exceeds 0.85 is 282.

Later in this section the expected sample size to establishment of non-trivial efficacy will be estimated when continuous sequential analysis is allowed. This number is 122 for the most skeptical prior used. The more general expected sample size is that required on the average to reach any conclusion. This number is 138.

9.9.1 Decision Thresholds

Before simulating the trials, let’s compute the decision thresholds, i.e., the least impressive values of the observed difference in means \(\hat{\Delta}\) that would trigger a given conclusion, as a function of the current sample size. Let \(\hat{\Delta}\) be a regular sequence of values for each sample size. For the current purpose don’t delay any efficacy looks, and treat each of the three reasons for stopping independently. For inefficacy, the \(\hat{\Delta}\) threshold is the highest \(\hat{\Delta}\) leading to stopping for inefficacy. For non-trivial efficacy, this is the lowest \(\hat{\Delta}\) leading to stopping for non-trivial effectiveness. For similarity the threshold is an interval with the lower value being the minimum \(\hat{\Delta}\) meeting the similarity criterion and the upper limit is the maximum \(\hat{\Delta}\) meeting the criterion.

Do the calculations for one \(\kappa\).

Code

difs <-seq(-1.5, 1.5, by=0.001)ns <-1:750d <-expand.grid(dif=difs, n=ns)setDT(d)thresh <-function(Deltahat, n, kappa=0.05) {# For an ascendeing Deltahat vector and single value of n# Compute posterior probability of any efficacy p1 <-pp(0., Deltahat, n, kappa)# Compute probability of non-trivial efficacy p2 <-pp(0.1, Deltahat, n, kappa)# Compute probability of similarity p3 <-pp(-0.15, Deltahat, n, kappa) -pp( 0.15, Deltahat, n, kappa) ineff <-max(Deltahat[p2 <0.025]) eff <-min(Deltahat[p1 >0.95& p2 >0.85]) simlow <-min(Deltahat[p3 >0.95]) simhi <-max(Deltahat[p3 >0.95]) x <-c(ineff, eff, simlow, simhi) x[is.infinite(x)] <-NAlist(conclusion=c('inefficacy', 'non-trivial efficacy', 'similarity-low', 'similarity-high'),Deltahat=x)}w <- d[, thresh(dif, n, kappa=0.05), by=n]R$threshold <- ww[n %in%c(50, 100, 200, 750)]

ggplot(w, aes(x=n, y=Deltahat, col=conclusion)) +geom_line() +scale_x_continuous(minor_breaks=seq(0, 675, by=25)) +scale_y_continuous(breaks=seq(-1.5, 1.5, by=0.5),minor_breaks=seq(-1.5, 1.25, by=0.1)) +ylab(expression(hat(Delta))) +labs(title='Decision thresholds in terms of observed differences in means')

Code

sim.min <- w[conclusion =='similarity-low'&!is.na(Deltahat), min(n)]sim.max <- w[conclusion =='similarity-high'& n ==750, Deltahat]ineff100 <- w[conclusion =='inefficacy'& n ==100, Deltahat]ineff200 <- w[conclusion =='inefficacy'& n ==200, Deltahat]nte50 <- w[conclusion =='non-trivial efficacy'& n ==50, Deltahat]nte200 <- w[conclusion =='non-trivial efficacy'& n ==200, Deltahat]nte.min <- w[conclusion =='non-trivial efficacy'& n ==750, Deltahat]

A few observations are in order.

Similarity can’t be concluded until \(n > 332\)

Even then the observed difference needs to be almost exactly zero to make that conclusion

At the maximum sample size \(n=750\) similarity is concluded if the absolute value of the observed \(\hat{\Delta}\) is 0.066 or less (recall that similarity was defined as \(\Delta \in [-0.15, 0.15]\))

An inefficacy conclusion is possible if \(\hat{\Delta} < -0.181\) when \(n=100\) and if \(\hat{\Delta} < -0.096\) when \(n=200\)

A non-trivial efficacy conclusion is possible if \(\hat{\Delta} > 0.36\) when \(n=50\) or when \(\hat{\Delta} > 0.211\) when \(n=200\)

When \(\hat{\Delta} = 0.156\) then planned maximum sample size of \(n=750\) is required to be able to conclude non-trivial efficacy. Recall that the MCID is 0.3 and the threshold for trival effect is \(\Delta=0.1\).

Decision Thresholds on \(z\) Scale

Since the observed treatment effect \(\hat{\Delta}\) has variance \(\frac{2}{n}\) with \(n\) participants per group, the \(\hat{\Delta}\) thresholds are easily converted into the \(z\) scale \(\sqrt{\frac{n}{2}}\hat{\Delta}\).

Code

ggplot(w, aes(x=n, y=Deltahat *sqrt(n /2), col=conclusion)) +geom_line() +scale_x_continuous(minor_breaks=seq(0, 675, by=25)) +ylab(expression(z)) +labs(title='Decision thresholds on the z-scale')

The patterns are different from what is seen with a frequentist approach, as here we see that when \(n > 50\) the critical value for \(z\) increases. That is because we are demanding at least an 0.85 chance that the efficacy is non-trivial (\(\Delta > \gamma\)). The drop in the upper curve from the lowest \(n\) for which a conclusion of non-trivial efficacy is possible (\(n=7\)) to \(n=50\) is due to the influence of the skeptical prior for \(\Delta\) when \(n\) is small. When a flat analysis prior is used for \(\Delta\), the upturn disappears.

9.9.2 Simulation

Now we simulate 10,000 sequential trials for each \(\kappa\) to evaluate the Bayesian operating characteristics. \(\kappa\) and the true unknown treatment effect \(\Delta\) are varied. Studies are allowed to proceed to an upper limit sample size of 750 in each group. Each study is stopped when any stopping criterion is satisfied.

The first function simulates one complete trial, computing the cumulative summary statistic (difference in means) after 1, 2, …, 750 participants in each treatment group.

Code

sim <-function(Delta, kappa, N=750) {# Simulate raw data for whole planned trial, # for each treatment y1 <-rnorm(N, 0., 1.) y2 <-rnorm(N, Delta, 1.) j <-1: N ybar1 <-cumsum(y1) / j ybar2 <-cumsum(y2) / j Deltahat <- ybar2 - ybar1 Deltahat # cumulative mean difference}

The next function executes the stopping rules for one trial, looking sequentially.

Code

sim1 <-function(Delta, kappa, ignore=59) { Deltahat <-sim(Delta, kappa) n <-1:length(Deltahat)# Compute posterior probability of any efficacy p1 <-pp(0., Deltahat, n, kappa)# Compute probability of non-trivial efficacy p2 <-pp(0.10, Deltahat, n, kappa)# Compute probability of similarity p3 <-pp(-0.15, Deltahat, n, kappa) -pp( 0.15, Deltahat, n, kappa)# For each decision find first sample size at which criterion was satisfied k <- n > ignore ineff <-which((p2 <0.025))[1] eff <-which(k & (p1 >0.95& p2 >0.85))[1] sim <-which(k & (p3 >0.95))[1] u <-function(r, n) list(conclusion=r, n=n)# If not stopped, trial went to maximum All <-c(ineff, eff, sim)if(all(is.na(All))) return(u('none', max(n))) first <-min(All, na.rm=TRUE)if(!is.na(ineff) && ineff == first) return(u('inefficacy', ineff))if(!is.na(eff) && eff == first) return(u('non-trivial efficacy', eff))u('similarity', sim)}

Run all 50,000 trials. For each \(\Delta\) and \(\kappa\) simulate one clinical trial. Do this over a very fine grid of \(\Delta\) so that we can later smooth the individual results and do not need multiple trials for any one value of \(\Delta\). \(\Delta\)s are drawn from a mean zero normal distribution with \(\Pr(\Delta > 0.75) = 0.025\).

Even with the most skeptical prior of the four priors, more than half of the simulated studies which ultimately stopped for efficacy could be stopped with that conclusion after just 60 participants, the minimum required for precision. We also need to examine the stopping times counting the other reasons. Here are the mean and median number of participants when stopping is for any reason (including going to the maximum sample size of 750).

For planning purposes the expected sample size may be taken as 138. Sequential analysis without the traditional frequentist limitations on the number of looks makes a large difference in trial efficiency while providing evidence in more dimensions.

Now consider the most important aspect of the performance of the design: the primary Bayesian operating characteristic—correctness of decisions about treatment effect. For our simulation study we judge the alignment between the decision about the treatment effect and the (unknown to us and to the sequential analyses but known to the simulation code) true treatment effect that was in play when the data were simulated. So for each prior and each decision that was made, we estimate the distribution of true treatment effects that were in effect when combining all the simulated trials resulting in that decision. This is the distribution of the \(\Delta\) values conditional on stopping (or going to the sample size limit of 750 for the no-conclusion simulations) and on the reason for stopping. Keep in mind that for the simulations \(\Delta\)s were drawn from \(\kappa=0.025\).

Examine distributions of underlying true \(\Delta\) for each prior and each conclusion (reason for stopping).

For a conclusion of non-trivial efficacy, the true effect in play was rarely > 0.2. Let’s look at the actual proportions of true \(\Delta\)s in various intervals vs. the conclusion made.

The least favorable result is obtained when \(\kappa=0.2\) when the analysis prior is significantly less skeptical than the way the \(\Delta\)s were generated. To highlight two important points from that first part of the table, for trials that were stopped with a conclusion of non-trivial efficacy, 0.91 of such trials were truly generated by \(\Delta > 0.1\). But 0.98 had any efficacy (\(\Delta > 0\)). For trials stopped for inefficacy, 0.9 of them were generated by a harmful treatment. 0.97 of them were generated by a harmful or at most only trivially effective treatment. Inconclusive studies tended to have treatment effect near \(\Delta = 0.1\). Only about 0.024 of the simulated trials resulted in no conclusion when \(\kappa=0.001\), i.e., wasted resources by randomizing 750 participants to each treatment arm only to have no conclusion at all.

To better understand the “no conclusion” results here are the quartiles and 0.1 and 0.9 quantiles of true \(\Delta\)s in effect when trials went to the maximum sample size without stopping.

Code

g <-function(x) { z <-quantile(x, c(10, 25, 50, 75, 90)/100)as.list(round(z, 2))}d[conclusion =='none', g(Delta), by=kappa]

The median true \(\Delta\) when no conclusion could be reached was very near \(\gamma = 0.1\), the threshold for triviality of treatment effect. So \(\frac{1}{2}\) of the futile trials were generated by trivial efficacy. Very few if any of the trials were generated by a completely ineffective treatment (\(\Delta < 0\)).

Finally, examine the distribution of true \(\Delta\) values as a function of deciles of stopping time, when the conclusion was non-trivial efficacy.

Many studies stopped at the minimum of N=60 and these were reliable in judging the treatment to have more than trivial efficacy.

9.10 Borrowing Information

When the treatment has been well studied in prior randomized clinical trials it is reasonable to use at least some of that information in the prior distribution for the treatment effect in the new study

Some discounting is required because there may be doubt about

similarity of patients

general time trends in outcomes

changes in concomitant therapies that could detract from efficacy estimated in patients before such therapies were in use

changes in how outcomes are measured including duration of follow-up

Some ideal situations for borrowing of information include

use of a device on a bone that is not the bone previously studied, or use on skin on another part of the body from that studied

studying a device that is a tweak of a device that has been extensively studied previously

use of a drug on a different patient population where the difference in patients is not directly related to what the drug is intended to prevent

use of adult data about a drug to inform efficacy for treating the same disease in children

The posterior probability distribution of the treatment effect from the earlier study/studies becomes part of the prior

Important to include results from randomized studies that showed unimpressive treatment effects along with those that were “positive”; no cherry picking allowed

Discounting can be done a variety of ways, including power priors and mixtures of priors

Mixtures of priors are appealing and more interpretable

Mix a prior \(f_1\) that is somewhat skeptical of large treatment effects from the new data alone with an optimistic prior \(f_2\) from previous studies

Prior for the new study: \(f(\theta) = (1 - a) f_{1}(\theta) + a f_{2}(\theta)\)

\(a\) may be directly interpreted as the applicability of the previous data to the current comparison

Applicability can be thought of in two ways (see below); for purpose of Bayesian analysis which of the two you prefer does not matter

For either notion of applicability, uncertainties (width of posterior distribution from previous studies) must be carried forward

Applicability is a number between 0 and 1 and is either

The degree to which you think that relative efficacy of treatment as learned from previous trials applies as the relative efficacy in the current trial or

The probability that what is learned about relative efficacy in previous studies fully applies to the new study

To anchor the values:

\(a=0 \rightarrow\) previous data should be completely ignored

\(a=1 \rightarrow\) previous data are completely relevant and the new study is unnecessary

To justify a value of \(a\) one should undertake a survey on a reasonable number (say 12-20) of experts including, for drug studies, clinical pharmacologists familiar with absorption and metabolism of the drug in different patient types

See this for excellent work the FDA has done related to borrowing adult data for pediatric drug trials

But it is important to note that strong arguments can be made for including the raw data from other studies in a joint analysis with the new trial data, instead of summarizing the other data through a posterior distribution that is converted to a prior and then down-weighted. Down-weighting can come from adding a bias parameter for the historical data when forming the joint new data/old data Bayesian models. Major advantages of this approach include

using a standard simple prior (e.g., Gaussian) for discounting

making assumptions more explicit

explicitly modeling bias

getting more accurate analysis, for example by not assuming normality for a point estimate from historical data

covariate adjustment

This last point is all-important. If there is a shift in the distribution of important prognostic factors between two data sources, this shift can be handled elegantly and simply using standard covariate adjustment. This is far more effective and efficient than matching or playing with inclusion criteria. Without analyzing the raw data, the effect of covariate shifts is very hard to counter, especially when the outcome model is nonlinear. It is often unclear what part of the covariate distribution (e.g., shift in means vs. a shift in only one of the tails) should be accounted for when using summary and not raw data.

A simple case study in joint raw data modeling for incorporating historical control data into an RCT may be found at fharrell.com/post/hxcontrol. The joint analysis is facilitated by the Bayesian approach, which allows one to specify any number of models for joint analysis. When the models share one or more parameters, interesting things happen. The case study illustrates that

When the new RCT has some but not enough control patients, and control information from historical data is incorporated into the analysis, the prior distribution for the bias in the historical data is important. The amount of bias can be estimated.

When the new RCT has no controls, so that control information is coming only from the historical data, the prior distribution for the bias in the historical data is incredibly important. The amount of bias cannot be estimated from the data at hand, and comes solely from the bias parameter’s prior distribution. This truly reflects the limitation of not having concurrent controls.

No restriction on \(\beta_{3} \rightarrow\) new study stands on its own

Skeptical prior on \(\beta_{3} \rightarrow\) borrow information

Can also add a nonlinear interaction with a more skeptical prior

fharrell.com/post/ia quantifies how non-overlap of \(X\) distributions relates to the precision of \(X\)-specific treatment effects.

Adcock, C. J. (1997). Sample size determination: A review. The Statistician, 46, 261–283.

Berry, D. A. (2006). Bayesian clinical trials. Nat Rev, 5, 27–36.

excellent review of Bayesian approaches in clinical trials; "The greatest virtue of the traditional approach may be its extreme rigour and narrowness of focus to the experiment at hand, but a side effect of this virtue is inflexibility, which in turn limits innovation in the design and analysis of clinical trials. ... The set of “other possible results” depends on the experimental design. ... Everything that is known is taken as given and all probabilities are calculated conditionally on known values. ... in contrast to the frequentist approach, only the probabilities of the observed results matter. ... The continuous learning that is possible in the Bayesian approach enables investigators to modify trials in midcourse. ... it is possible to learn from small samples, depending on the results, ... it is possible to adapt to what is learned to enable better treatment of patients. ... subjectivity in prior distributions is explicit and open to examination (and critique) by all. ... The Bayesian approach has several advantages in drug development. One is the process of updating knowledge gradually rather than restricting revisions in study design to large, discrete steps measured in trials or phases."

Braun, T. M. (n.d.). Motivating sample sizes in adaptive Phase I trials via Bayesian posterior credible intervals. Biom, n/a. https://doi.org/10.1111/biom.12872

Grouin, J.-M., Coste, M., Bunouf, P., & Lecoutre, B. (2007). Bayesian sample size determination in non-sequential clinical trials: Statistical aspects and some regulatory considerations. Stat Med, 26, 4914–4924.

Joseph, L., & Bélisle, P. (1997). Bayesian sample size determination for normal means and differences between normal means. The Statistician, 46, 209–226.

Kunzmann, K., Grayling, M. J., Lee, K. M., Robertson, D. S., Rufibach, K., & Wason, J. M. S. (2021). A Review of Bayesian Perspectives on Sample Size Derivation for Confirmatory Trials. The American Statistician, 0(0), 1–9. https://doi.org/10.1080/00031305.2021.1901782

Pezeshk, H., & Gittins, J. (2002). A fully Bayesian approach to calculating sample sizes for clinical trials with binary reponses. Drug Info J, 36, 143–150.

Rigat, F. (2023). A conservative approach to leveraging external evidence for effective clinical trial design. Pharmaceutical Statistics, pst.2339. https://doi.org/10.1002/pst.2339

Includes some sample size considerations to ensure that the prior is not too impactful

"Every time the statistician working in the pharmaceutical industry does a sample size determination for a trial using a responder analysis, he or she should do the same calculation using the original measure. If the dichotomy is preferred, an explanation as to why the extra millions are going to be spent should be provided."

Simon, R., & Freedman, L. S. (1997). Bayesian design and analysis of two two factorial clinical trials. Biometrics, 53, 456–464.

Spiegelhalter, D. J., & Freedman, L. S. (1986). A predictive approach to selecting the size of a clinical trial, based on subjective clinical opinion. Stat Med, 5, 1–13. https://doi.org/10.1002/sim.4780050103

Spiegelhalter, D. J., Freedman, L. S., & Parmar, M. K. B. (1993). Applying Bayesian ideas in drug development and clinical trials. Stat Med, 12, 1501–1511. https://doi.org/10.1002/sim.4780121516

Wang, H., Chow, S.-C., & Chen, M. (2005). A Bayesian Approach on Sample Size Calculation for Comparing Means. J Biopharm Stat, 15(5), 799–807. https://doi.org/10.1081/bip-200067789

analytic form for posterior for normal t-test case

Whitehead, J., Cleary, F., & Turner, A. (2015). Bayesian sample sizes for exploratory clinical trials comparing multiple experimental treatments with a control. Stat Med, 34(12), 2048–2061. https://doi.org/10.1002/sim.6469

```{r echo=FALSE}require(Hmisc)require(ggplot2)require(qreport)Rd <- if(file.exists('design.rds')) readRDS('design.rds')```# Bayesian Clinical Trial Design {#sec-design}```{mermaid}%%| column: screen-insetflowchart LRD[Study Design] --> G[Desired Conclusions] & UDG --> Des[Efficacy<br>Non-trivial efficacy<br>Inefficacy<br>Harm<br>Non-inferiority<br>Similarity<br>Estimation]UD[Avoid] --> NO[No conclusion at<br>maximum N]D --> OC[Desired Operating<br>Characteristics]OC --> C & P & ST & PrecC[Correctness<br>of conclusions]P[High probability<br>of detecting effect=MCID at<br>the average sample size]ST[Expected stopping time]Prec[Adequate precision of estimated effect<br>if evidence for effectiveness]D --> Ch[Challenges] --> chl[Understanding that correctness of the conclusion is the only relevant OC for post-data decision making<br>Not allowing gaming of effect size used to choose N<br>Willingness to stop very early for futility<br>Whether to stop early for similarity<br>Formalizing minimum precision criterion<br>Specificity and objectivity about triviality of treatment effects]```**Example**```{mermaid}%%| column: screen-insetflowchart TDN["N<br>Frequentist: 234 for 0.9 power<br>Bayesian 0.9 for Δ > 0: 196<br>Bayesian 0.9 for Δ > γ: 282<br>Bayesian sequential N avg: 113<br>Bayesian Np=60 min N for precision<br>Max N: 750"]BS[Bayesian Simulation]BS --> SM[Different effect Δ for each trial]SM --> G[Bayesian goal:<br>Uncover Δ generating the data]G --> P[Universe of Δ<br>Chosen to be disadvantageous] --> NT[10,000 Trials] NT --> S[Sequential assessment<br>unlimited looks]S --> SI[Stop at any time<br>for inefficacy]S --> SS[Stop at any time<br> for similarity]S --> NON[No conclusion]S --> SE["Stop any time N≥Np for<br>non-trivial efficacy"]SI --> SIN[5184 Stopped early<br>Avg N: 62] --> SIC["5020 correct decision for Δ < γ (0.97)"]SS --> SSN[634 Stopped early<br>Avg N: 423] --> SSC["607 correct decision (0.96)"]SE --> SEN[4010 Stopped early<br>Avg N: 102] --> SEC["3643 correct decision for Δ > γ (0.91)<br>3913 correct decision for Δ > 0 (0.98)"]NON --> NONN[172 went to N=750] --> MD[Median Δ = γ]```Source: `hbiostat.org/bayes/bet/design`::: {.quoteit}Asking one to compute type I assertion probability α for a Bayesian design is like asking a poker player winning > $10M/year to justify his ranking by how often he places bets in games he didn't win.<br><br>Do you want the probability of a positive result **when** there is nothing? Or the probability that a positive result **turns out to be** nothing?:::Details in @sec-design-eop## Goals of a Therapeutic TrialThere are many goals of comparing treatments, including quantifying evidence for* any efficacy* non-trivial efficacy* inefficacy* harm* similarity* non-inferiorityand sometimes the goal is just to estimate the treatment effect, whatever it is. Bayesian methods provide probabilities that each of the above conclusions is correct, given the data, the data model, and the prior distribution for the treatment effect. For estimation, Bayes provides a complete probability distribution for the unknown effect, and corresponding interval estimates (highest posterior density intervals, credible intervals). With the enormous number of traditional frequentist clinical trials that reach their planned fixed sample size but have equivocal results, one particular advantage of Bayesian sequential designs is all-important: minimizing the chance of getting to the maximum sample size with no conclusion about efficacy. [Also, the Bayesian approach prevents the all-too-common [absence of evidence is not evidence of absence](https://www.bmj.com/content/311/7003/485) error arising from attempting to draw a conclusion from a large P-value.]{.aside}A major advantage of Bayes is that it provides evidence in favor of assertions, not just evidence against a hypothesis. That explains why Bayesian designs and analyses are the same (other than the assertion whose probability is being computed) for all study goals. Only the required sample size may differ by the trial's intent.Another major advantage of Bayes is that since it involves probabilities about unknowns given the current data, it does not have multiplicity problems with sequential evidence assessment (@sec-seq). Evidence from current data are not down-weighted because the data were examined previously or because they will be examined in the future. Bayesian trial design need not have a sample size, and if it does, there is nothing wrong with exceeding that sample size when results are equivocal. This implies that Bayesian designs are less likely to result in a trial having an equivocal result. And the more often one analyzes the data the lower the expected sample size. One can stop as soon as there is a pre-specified amount of evidence for one of the conclusions listed above. When sample size N is discussed below, it can refer to a planned maximum sample size, an average sample size, the 75$^\text{th}$ percentile of a distribution of sample sizes, etc. [Evidence from current data is the point of emphasis. Correctness of a decision is the probability that a decision made using the current data is correct. In frequentism, multiplicity comes from multiple chances to observe extreme data, which grows as the number of data looks increases. Bayesian posterior probabilities are about efficacy, not about data.]{.aside}With a Bayesian sequential design it is easy to add a requirement that the treatment effect be estimated with a specified precision. Often this precision (e.g., half-width of a credible interval) can be estimated before the trial starts. Then the precision-required sample size can define the time of the first data look.## Challenges* Change conceptions about operating characteristics + Primary OC must be the correctness of whatever decision is made* Be willing to stop a study very early when there is a very low probability of non-trivial efficacy* Whether to stop a study early for a conclusion of similarity of efficacy of the two treatments* Abandon need for estimating treatment effect when there is ample evidence for inefficacy* Be objective and formal about interpretation of effects by specifying a threshold for non-trivial effect* Removing cost from consideration of effect sizes to detect in favor of finding an outcome variable that will have rich enough information to achieve power with clinically-chosen effect size::: {.callout-note collapse="true"}## * Regulators and sponsors must recognize that the operating characteristic that needs to be optimized is the correctness of whatever decision is made about treatment effectiveness.* Sponsors must be willing to stop studies very early when there is strong evidence of inefficacy. This will free up patients with rare diseases for other trials, and free up resources likewise.* To be able to stop studies very early so as to not waste resources when the probability is very high that there is no non-trivial efficacy, sponsors must jettison the need to be able to estimate how ineffective a treatment is.* At present, consideration of whether a nonzero treatment effect is trivial is highly subjective and informal. To make optimal and defensible decisions, this must be formalized by specifying a threshold for non-triviality of effect. Importantly, this threshold is smaller than the MCID.* The current practice of choosing a more than clinically important effect size to use in power calculations has been called "sample size samba" and has resulted in a scandalous level of overly optimistically designed trials, commonly having an equivocal result. Many equivocal trials observed an effect that was merely clinically significant. Powering the study to only detect a "super clinically significant" effect results in low actual power and wasted resources. Objectively choosing effect sizes without cost considerations, then abandoning studies that cannot find an acceptable outcome variable that has sufficient resolution to achieve adequate power with a "real" effect size will prevent a large amount of waste and make clinical trials more ethical while freeing up patients with rare diseases to participate in other trials.:::## Bayesian Operating Characteristics::: {.quoteit}The primary Bayesian operating characteristic is how often a conclusion is correct. The primary frequentist operating characteristic is how often one reaches a conclusion.::: {.callout-note collapse="true"}## Bayesian inference allows one to make both positive ("the drug works") and negative ("the drug doesn't work") conclusions. Evaluation of Bayesian operating characteristics can focus on either the probability that whatever decision is made is correct, or the correctness of a single decision such as a drug works. Frequentism based on null hypothesis testing and P-values only brings evidence _against_ a hypothesis, so the only possible actions are (1) to reject $H_0$ (the one actual conclusion a frequentist approach can make), (2) get more data, or (3) stop with no conclusion except a failure to reject $H_0$. The operating characteristic given greatest emphasis by regulators has been type I assertion probability $\alpha$, the probability of rejecting $H_0$ when it is magically known to be true.::::::When designs are being formulated and when N is being selected, researchers need to know that the typical way the trial is likely to unfold is acceptable. The performance of a proposed Bayesian design must be judged by its Bayesian operating characteristics (OCs). Bayes is about prospective analysis and interpretation, i.e., trying to uncover the unknown based on what is known. It minimizes the use of unobservables in acting in forward information-flow predictive mode. Frequentist OCs deal with _what may happen_ under unknowable assumptions about the treatment effect generating the data. Bayesian OCs are the only OCs important when running Bayesian analyses, interpreting the results, and knowing how much to trust them. A **key point** is that accuracy of conclusions pertains to what has been learned from data, not what data may arise under certain unknowable assumptions about effects. Accuracy is a post-data consideration.Bayesian OCs are of four types. First there is _assurance_, which is the ability of the design to detect a real treatment effect were it to exist. This is Bayesian power (the chance of achieving a high posterior probability of efficacy). Second is precision of post-data knowledge about the treatment effect, e.g., half of the width of an uncertainty interval. Third is the expecting stopping time, which is related to the efficiency of the design and the expected cost of the study. Fourth is the all-important judgment of the correctness of the decision at the moment the decision is made. Bayesian approaches optimize the chance of conclusions being correct if the following three conditions hold: [The first two conditions are in common with traditional frequentist approaches.]{.aside}* the variables being analyzed are measured unbiasedly* the data model is consistent with the experimental design and is reasonably well-fitting* the prior used in the analysis is not in conflict with the prior used by the judge of the accuracy of the study resultsQuantifying correctness of conclusions for a Bayesian design is equivalent to assessing predictive accuracy, which involves examining what's real as a function of predicted values. Correctness of conclusions (predictions) is easy to check in a simulation study, as explained in detail in @sec-design-eop. The process involves looking back to the treatment effect that generated the current trial data after the conclusion about the treatment effect has been made. After a large number of trials are simulated, the subset of trials reaching a certain conclusion is chosen, and for that subset the proportion of trials for which the true treatment effect is consistent with the conclusion is computed. [When Bayes is used to the fullest, decisions take consequences into account through a utility, loss, or cost function. The optimum Bayes decision that maximizes expected utility is a function of the posterior distribution of the treatment effect and the utility function (decision consequences). But different stakeholders have different utility functions, so we are usually content to deal only with the posterior distribution, a major input into the optimum decision.]{.aside}In actual studies most of the attention about this most important OC is focused on the choice of the prior distribution. When the prior is acceptable, the Bayesian procedure will result in optimum decisions in the eye of the beholder of the prior, by definition. Sensitivity of decisions to the choice of a prior would be a useful assessment at the design stage.In @sec-seq Bayesian sequential analysis is studied in more detail, though the simulations there are simple from the standpoint of having only one reason for stopping. In that situation, one stops, for example, when a posterior probability PP exceeds 0.95. When the simulation is done using the same prior as the prior used in the analysis, if the PP is 0.96 upon stopping, the true probability of a treatment effect will be exactly 0.96. Trusting the Bayesian results at the decision point requires the PP being well calibrated. This happens when there is no major conflict between the design prior and the analysis prior. [A simple [play-by-play analysis of football games](https://fharrell.com/post/nfl) and a simple [simulation of a sequential clinical trial](https://fharrell.com/post/bayes-seq) illustrate these points.]{.aside}## Example: Two-Treatment Parallel-Group Efficacy Design**Goals*** Maximize the probability of making the right decision, with 4 decisions possible (efficacy, similarity, inefficacy, not possible to draw any conclusion with available resources)* Minimize the probability of an inconclusive result at study end* Minimize the expected sample size* Stop as early as possible if there is a very high probability that the treatment's effect is trivial or worse, even if the magnitude of the effect cannot be estimated with any precision* If stopping early for non-trivial efficacy, make sure that the sample size provides at least crude precision of the effect estimate + Do this by not making the first efficacy look until the lowest sample size for which at least crude precision of the treatment effect is attained* Avoid gaming the effect size used in a power calculation* Recognize that traditional sample sizes are arbitrary and require too many assumptions* Recognize that what appears to be a study extension to one person will appear to be an interim analysis to another who doesn't need to do a sample size calculation* Build clinical significance into the quantification of evidence for efficacy**Definitions*** $\Delta$: true treatment effect being estimated* $\delta$: minimum clinically important $\Delta$* $\gamma$: minimum detectable $\Delta$ (threshold for non-trivial treatment effect, e.g. $\frac{1}{3}\delta$)* SI: similarity interval for $\Delta$, e.g., $[-\frac{1}{2}\delta, \frac{1}{2}\delta]$* $N$: average sample size, used for initial resource planning. This is an estimate of the ultimate sample size based on assuming $\Delta = \delta$. $N$ is computed to achieve a Bayesian power of 0.9 for efficacy, i.e., achieving a 0.9 probability at a fixed sample size that the probability of any efficacy exceeds 0.95 while the probability of non-trivial efficacy exceeds 0.85, i.e., $\Pr(\Delta > 0) > 0.95$ and $\Pr(\Delta > \gamma) > 0.85$.* More than one $N$ can be computed, e.g., $N$ such that if $\Delta = \gamma$ there is ever a high probability of stopping for inefficacy* $N_p$: minimum sample size for first efficacy assessment, based on required minimum precision of estimate of $\Delta$* Probability cutoffs are examples* This design does not require a separate futility analysis, as inefficacy subsumes futility* Below, "higher resolution Y" can come from breaking ties in the outcome variable (making it more continuous), adding longitudinal assessments, or extending follow-up```{mermaid}%%| column: screen-insetflowchart TDA[Choose response variable Y with<br>maximum resolution that is<br>relevant to the patient] --> B[Elicit effect sizes δ, γ for Y from experts and patients<br>who are unaware of budget limits] --> D[Compute average sample size N<br>and minimum efficacy sample size Np] --> E[Is N feasible?]E --> |No| G[Can a higher resolution Y be developed?] --> |No| I[Abandon Study]G --> |Yes| BE --> |Yes| L[Begin study<br>Do frequent sequential analysis] --> N["Pr(Δ > γ) < 0.025?"]N --> |Yes| P[Stop for inefficacy<br>Low chance of non-trivial effect]N --> |No| Np["Have ≥ Np participants finished<br>their first follow-up?"]Np --> |No| Y Np --> |Yes| R["Pr(Δ > 0) > 0.95 and Pr(Δ > γ) > 0.85"] --> |Yes| T[Stop for non-trivial efficacy]R --> |No| R2["Pr(Δ in SI) > 0.95"] --> |Yes| R3[Stop for similarity]R2 --> |No| V2[Resources exhausted?] --> |No| Y[Accrue more participants] --> NV2 --> |Yes| Z[Stop with inability<br>to reach any<br>conclusions] ```Example operating characteristics for this design are simulated in @sec-design-eop.------ * Bayesian inference opens up a wide variety of adaptive trials + stat methods need no adaptations + only sample size and Bayesian power calculations are complicated* But here consider non-adaptive trials with possible sequential monitoring## Sample Size Estimation* In a sense more honest than frequentist b/c uncertainty admitted + SDs + effect size* @whi15bay: sample size calculation for control vs. set of active tx + goal: high PP that ≥ 1 tx better than control or + high PP that none is better* Other good references: @kun21rev, @spi93app, @gro07bay, @spi86pre, @adc97sam, @jos97baya, @pez02ful, @sim97bay, @wan05bay, @bra18mot, [this](https://arxiv.org/pdf/2006.15715.pdf)* @spi93app has clear rationale for _unconditional power_ integrating over prior + "realistic assessment of the predictive probability of obtaining a 'significant' result" + discusses damage caused by using fixed optimistic treatment effect* Uebersax has an excellent article on [Bayesian unconditional power analysis](https://john-uebersax.com/stat/bpower.htm) which also discusses the hazards of choosing a single effect size for power estimation## Bayesian Power and Sample Size Examples {#sec-design-power}* Example notion of power: probability that PP will exceed a certain value such as 0.95 (Bayesian assurance)* Compute as a function of the unknown effect, see which true effects are detectable for a given sample size* If the minimum clinically important difference (MCID) is known, compute the success probability at this MCID* Better: take uncertainty into account by [sampling the MCID from an uncertainty (prior) distribution for it](https://www.linkedin.com/pulse/bayesian-power-assurance-probability-success-quick-whos-monseur), using a different value of MCID for each simulated trial. The Bayesian power (probability of success) from combining all the simulations averages over the MCID uncertainties. This is demonstrated in a time-to-event power calculation below.### One-Sample Mean Example* Consider earlier one-sample mean problem* Data are normal with variance 1.0 and the prior is normal with mean zero and variance $\frac{1}{\tau}$* Take $\mu > 0$ to indicate efficacy* PP $= P(\mu > 0 | \textrm{data}) = \Phi(\frac{n \overline{Y}}{\sqrt{n + \tau}})$* $\overline{Y}$ is normally distributed with mean $\mu$ and variance $\frac{1}{n}$* Long-run average of $\overline{Y}$ is $\mu \rightarrow$ long-run median of $\overline{Y}$ is also $\mu \rightarrow$ median PP $= \Phi(\frac{n \mu}{\sqrt{n + \tau}})$* Let $z = \Phi^{-1}(0.95)$* Chance that PP exceeds 0.95 is$P(\Phi(\frac{n \overline{Y}}{\sqrt{n + \tau}}) > 0.95) =P(\frac{n \overline{Y}}{\sqrt{n + \tau}} > z) =\Phi(\frac{n \mu - z \sqrt{n + \tau}}{\sqrt{n}})$* For varying $n$ and $\tau$ the median PP and the Bayesian power are shown in the figure```{r bpower}#| fig.scap: "Bayesian power and median posterior probability"#| fig.cap: "Bayesian power and median posterior probability for varying true effect values, sample sizes, and degrees of skepticism"#| fig.width: 8#| fig.height: 7d <- expand.grid(mu=seq(-2, 2, length=200), n=c(1, 2, 5, 10), tau=c(0, 2, 10), what=1:2)d <- transform(d, p = ifelse(what == 1, pnorm(n * mu / sqrt(n + tau)), pnorm((n * mu - qnorm(0.95) * sqrt(n + tau)) / sqrt(n))), n = factor(n), tau = factor(tau))d$Probability <- factor(d$what)levels(d$n) <- paste0('n==', levels(d$n))levels(d$tau) <- paste0('tau==', levels(d$tau))ggplot(d, aes(x=mu, y=p, col=Probability)) + geom_line() + scale_color_discrete(labels=expression(Median~P(mu > 0), P(P(mu > 0) > 0.95))) + facet_grid(tau ~ n, labeller=label_parsed) + xlab(expression(mu)) + ylab('Probability')```### Sample Size for Comparing Time to Event {#sec-design-cpower}In this example we compute the sample size needed to achieve adequate Bayesian power to achieve evidence for various assertions. The treatment effect is summarized by a hazard ratio (HR) in a proportional hazards (PH) model. Assertions for which the study needs to have the ability to provide conclusive evidence include* any reduction of hazard of an event (HR < 1)* HR < 0.98, 0.96, 0.94, ...* harm or trivial benefit (HR > 0.95)The trial's actual analysis might use a flexible parametric PH model implemented in the [`rstanarm` Bayesian survival analysis package](https://arxiv.org/pdf/2002.09633.pdf). Bayesian operating characteristics are simulated here using the Cox PH model.Take Bayesian power to be the probability that the posterior probability will exceed 0.975 when the true HR is a constant^[Ideally one would consider power over a distribution of HRs as done below.]. For the first two types of assertions, set the HR generating the data to 0.75 (assumed minimally clinically important difference), and for the third type of assertion set HR to 1.25.The actual analysis to be used for the clinical trial should adjust for covariates, but covariates are ignored in our simulations. Bayesian power will increase with the addition of covariates.If the treatment groups are designated A (standard of care) and B (new treatment), suppose that the following Weibull survival distribution has been found to fit patient's experiences under control treatment.$$S(t) = e^{- 0.9357 t^{0.5224}}$$The survival curve for treatment B patients is obtained by multiplying 0.9357 by the B:A hazard ratio (raising $S(t)$ to the HR power).Assume further that follow-up lasts 3y, 0.85 of patients will be followed a full 3y, and 0.15 of patients will be followed 1.5y when the final analysis is done^[The R `Hmisc` package [`spower` function](https://www.rdocumentation.org/packages/Hmisc/versions/4.6-0/topics/spower) will simulate much more exotic situations.].To compute Bayesian power, use the normal approximation to the Cox partial likelihood function, i.e., assume that log(HR) is normal and use standard maximum likelihood methods to estimate the mean and standard deviation of the log(HR). The Cox log-likelihood is known to be very quadratic in the log HR, so the normal approximation will be very accurate. For a prior distribution assume a very small probability of 0.025 that HR < 0.25 and a probability of 0.025 that it exceeds 4. The standard deviation for log(HR) that accomplishes this is $\frac{\log(4)}{1.96} = 0.7073 = \sigma$.With a normal approximation for the data likelihood and a normal prior distribution for log(HR), the posterior distribution is also normal. Let $\theta$ be the true log(HR) and $\hat{\theta}$ be the maximum likelihood estimate of $\theta$ in a given simulated clinical trial. Let $s$ be the standard error of $\hat{\theta}$^[Note that in a proportional hazards model such as the one we're using, the variance of $\hat{\theta}$ is very close to the sum of the reciprocals of the event frequencies in the two groups.].Let $\tau = \frac{1}{\sigma^{2}} = 1.9989$ and let $\omega = \frac{1}{s ^ {2}}$.Then the approximate posterior distribution of $\theta$ given the data and the prior is then normal with mean $\mu$ and variance $v$ where $v = \frac{1}{\tau + \omega}$ and $\mu = \frac{\omega \hat{\theta}}{\tau + \omega}$.The posterior probability that the true HR < $h$ is the probability that $\theta < \log(h)$, which is $\Phi(\frac{\log(h) - \mu}{\sqrt{v}})$.#### SimulationFirst consider code for generating outcomes for one clinical trial of size $n$ per group, follow-up time cutoff $u$, secondary cutoff time $u2$ with probability $pu2$, and true hazard ratio $H$. The result is posterior mean and variance and the P-value from a frequentist likelihood ratio $\chi^2$ test with the Cox PH model.```{r}require(data.table)require(ggplot2)require(rms)require(survival)sim1 <-function(n, u, H, pu2=0, u2=u, rmedian=NA) {# n: number of pts in each group# u: censoring time# H: hazard ratio# pu2: probability of censoring at u2 < u# rmedian: ratio of two medians and triggers use of log-normal model# Generate failure and censoring times for control patients# Parameters of a log-normal distribution that resembles the Weibull# distribution are below (these apply if rmedian is given) mu <--0.2438178 sigma <-1.5119236 ta <-if(is.na(rmedian)) (-log(runif(n)) /0.9357) ^ (1.0/0.5224)elseexp(rnorm(n, mu, sigma)) ct <-pmin(u, ifelse(runif(n) < pu2, u2, u)) ea <-ifelse(ta <= ct, 1, 0) ta <-pmin(ta, ct) # apply censoring# Generate failure and censoring times for intervention patients tb <-if(is.na(rmedian)) (-log(runif(n)) / (H *0.9357)) ^ (1.0/0.5224)elseexp(rnorm(n, mu +log(rmedian), sigma)) ct <-pmin(u, ifelse(runif(n) < pu2, u2, u)) eb <-ifelse(tb <= ct, 1, 0) tb <-pmin(tb, ct) ti <-c(ta, tb) ev <-c(ea, eb) tx <-c(rep(0, n), rep(1, n)) f <-if(is.na(rmedian)) coxph(Surv(ti, ev) ~ tx) elsesurvreg(Surv(ti, ev) ~ tx, dist='lognormal') chi <-2*diff(f$loglik) theta.hat <-coef(f)[length(coef(f))] s2 <-vcov(f)['tx', 'tx'] tau <-1.9989 omega <-1.0/ s2 v <-1.0/ (tau + omega) mu <- omega * theta.hat / (tau + omega)if(!all(c(length(mu), length(v), length(chi))) ==1) stop("E")list(mu =as.vector(mu),v =as.vector(v),pval =as.vector(1.0-pchisq(chi, 1)))}```Simulate all statistics for $n=50, 51, ..., 1000$ and HR=0.75 and 1.25, with 10 replications per $n$/HR combination.```{r wsim}HR <- 0.75ns <- seq(50, 1000, by=1)R <- expand.grid(n = ns, HR=c(0.75, 1.25), rep=1 : 10)setDT(R) # convert to data.tableg <- function() { set.seed(7) R[, sim1(n, 3, HR, 0.15, 1.5), by=.(n, HR, rep)] }S <- runifChanged(g, R, sim1, file='wsim.rds')```The number of trials simulated for each of the two HR's was `r nrow(S) / 2`.Before simulating Bayesian power, simulate frequentist power for the Cox PH likelihood ratio $\chi^2$ test with two-sided $\alpha=0.05$. Effect size to detect is HR=0.75.```{r}R <- S[HR ==0.75, .(reject =1* (pval <0.05)), by=n]# Compute moving proportion of trials with p < 0.05# Use overlapping windows on n and also logistic regression with# a restricted cubic spline in n with 5 knotsm <-movStats(reject ~ n, eps=100, lrm=TRUE,data=R, melt=TRUE)ggplot(m,aes(x=n, y=reject, linetype=Type)) +geom_line() +geom_hline(yintercept=0.9, alpha=0.3) +scale_x_continuous(breaks=seq(0, 1000, by=100)) +scale_y_continuous(breaks=seq(0, 1, by=0.1),minor_breaks=seq(0, 1, by=0.05)) +ylab('Pr(reject H0:HR=1)')```Use linear interpolation to solve for the sample size needed to have 0.9 frequentist power.```{r}lookup <-function(n, p, val=0.9)round(approx(p, n, xout=val)$y)m[Type =='LR', lookup(n, reject)]```For each trial simulated with HR=0.75, compute posterior probabilities that the true HR $< h$ for various HR cutoffs $h$. Both moving proportions and logistic regression with a restricted cubic spline in $n$ are used to smooth the results to estimate the relationship between $n$ and Bayesian power, i.e., the probability that the posterior probability reaches 0.975 or higher.```{r}R <- S[HR ==0.75]hs <-seq(0.8, 1.0, by =0.02)pp <-function(mu, v) list(h=hs, p=pnorm((log(hs) - mu) /sqrt(v)))R <- R[, pp(mu, v), by=.(n, rep)]# Compute moving proportion with posterior prob > 0.975# with overlapping windows on n, and separately by hR[, phi :=1* (p >=0.975)]m <-movStats(phi ~ n + h, eps=100, lrm=TRUE, lrm_args=list(maxit=30),data=R, melt=TRUE)ggplot(m,aes(x=n, y=phi, color=h, linetype=Type)) +geom_line() +geom_hline(yintercept=0.9, alpha=0.3) +scale_x_continuous(breaks=seq(0, 1000, by=100)) +scale_y_continuous(breaks=seq(0, 1, by=0.1),minor_breaks=seq(0, 1, by=0.05)) +ylab('Pr(Posterior Pr(HR < h) > 0.975)')```Use linear interpolation to compute the $n$ for which Bayesian power is 0.9.```{r}u <- m[Type =='LR', .(nmin=lookup(n, phi)), by=h]uNprime <- u[h ==1, nmin]```Check the Bayesian power to detect a trivial or worse treatment benefit when true HR is 1.25.```{r}HR <-1.25hs <-0.95R <- S[HR ==1.25, pp(mu, v), by=.(n, rep)]# One minus because we want Pr(HR > 0.95)R[, phi :=1* (1.0- p >0.975)]m <-movStats(phi ~ n, eps=100, lrm=TRUE, data=R, melt=TRUE,lrm_args=list(maxit=30))ggplot(m, aes(x=n, y=phi, linetype=Type)) +geom_line() +ylab('Pr(Posterior Pr(HR > 0.975))') +geom_hline(yintercept=0.9, alpha=0.3) +scale_x_continuous(breaks=seq(0, 1000, by=100)) +scale_y_continuous(breaks=seq(0, 1, by=0.1),minor_breaks=seq(0, 1, by=0.05))```Find $n$ where it crosses 0.9.```{r}m[Type =='LR', lookup(n, phi)]```#### Accounting for Uncertainty in MCIDDifferent reviewers and journal readers may have different ideas of a minimal clinically important difference to detect. Instead of setting the MCID to HR=0.75, suppose that all HR in the interval [0.65, 0.85] are equally likely to be the MCID at the time the study findings are made known. Let's re-simulate 1000 trials to compute Bayesian power at a sample size of `r Nprime` per group, varying the MCID over [0.65, 0.85].```{r}set.seed(7)R <-expand.grid(n = Nprime, HR=runif(1000, 0.65, 0.85))setDT(R) # convert to data.tableS <- R[, sim1(n, 3, HR, 0.15, 1.5), by=.(n, HR)]pp <-function(mu, v) pnorm((log(1.0) - mu) /sqrt(v))bp <- S[, mean(pp(mu, v) >0.975)]bp```The Bayesian power drops from 0.9 to `r bp` when the stated amount of uncertainty is taken into account for MCID.### Bayesian Power for 3-Level Ordinal Outcome {#sec-design-popower}Suppose that the above time-to-event outcome is primary, and that there is a secondary outcome of infection within 90 days of randomization. To explicitly account for deaths within the 90-day window we use a 3-level proportional odds ordinal logistic model with 0=alive and infection-free, 1=alive and infection, 2=dead (whether or not preceded by infection). For the actual study we may use a partial proportional odds model to allow for a different treatment effect for the union of infection and death as compared to the effect for death alone. A prior distribution can be specified for the ratio of the two odds ratios (for infection + death and for death) such that there is a 0.95 chance that the ratio of odds ratios is between $\frac{1}{3}$ and $3$. A Bayesian power simulation with that amount of borrowing between the treatment effect for the two outcomes is presented at the end of this section.Use the Weibull model described above to estimate 90-day mortality.```{r}rnd <-function(x) round(x, 3)S <-function(t) exp(-0.9357* t ^0.5224)m90 <-1.-S(90/365.25)m30 <-1.-S(30/365.25)```The estimated mortality by 90d is `r rnd(m90)`. The estimated probability of infection for 90-day survivors is 0.2, so the ordinal outcome is expected to have relative frequencies listed in the first line below.```{r}# Hmisc posamsize and popower function cell probabilities are averaged# over the two groups, so need to go up by the square root of the odds # ratio to estimate these averagesphigh <-c(1.-0.2- m90, 0.2, m90)or <-0.65pmid <-pomodm(p=phigh, odds.ratio=sqrt(or))plow <-pomodm(p=pmid, odds.ratio=sqrt(or))phigh <-rnd(phigh)pmid <-rnd(pmid)plow <-rnd(plow)n <-round(posamsize(pmid, odds.ratio=or, power=0.8)$n /2)Infpow <-round(popower(pmid, odds.ratio=or, n=2* Nprime)$power, 3)Infpow```| Treatment | Y=0 | Y=1 | Y=2 ||:-------------|-------------:|-------------:|-------------:|| Standard Tx | `r phigh[1]` | `r phigh[2]` | `r phigh[3]` || New Tx | `r plow[1]` | `r plow[2]` | `r plow[3]` || Average | `r pmid[1]` | `r pmid[2]` | `r pmid[3]` |The second line in the table are the outcome probabilities when an odds ratio of 0.65 is applied to the standard treatment probabilities. The last line is what is given to the `posampsize` and `popower` functions to compute the sample size need for an $\alpha=0.05$ two-sided Wilcoxon/proportional odds test for differences in the two groups on the 3-level outcome severity with power of 0.8. The needed sample size is `r n` per group were a frequentist test to be used. The frequentist power for the sample size (`r Nprime` per group) that will be used for the primary outcome is `r Infpow`.For Bayesian power we use a process similar to what was used for the proportional hazards model above, but not quite trusting the quadratic approximation to the log likelihood for the proportional odds model, due to discretenesss of the outcome variable. Before simulating a power curve for all sample sizes using the normal approximation to the data likelihood, we do a full Bayesian simulation of 1000 trials with 300 patients per group in addition to using the approximate normal prior/normal likelihood method.Here is a function to simulate one study with an ordinal outcome.```{r k3}require(rmsb)options(mc.cores=4, rmsb.backend='cmdstan')cmdstanr::set_cmdstan_path(cmdstan.loc)sim1 <- function(n, pcontrol, or, orcut=1, bayes=FALSE) { # n: number of pts in each group # pcontrol: vector of outcome probabilities for control group # or: odds ratio # orcut: compute Pr(OR < orcut) # Compute outcome probabilities for new treatment ptr <- pomodm(p=pcontrol, odds.ratio=or) k <- length(pcontrol) - 1 tx <- c(rep(0, n), rep(1, n)) y <- c(sample(0 : k, n, replace=TRUE, prob=pcontrol), sample(0 : k, n, replace=TRUE, prob=ptr) ) f <- lrm(y ~ tx) chi <- f$stats['Model L.R.'] theta.hat <- coef(f)['tx'] s2 <- vcov(f)['tx', 'tx'] tau <- 1.9989 omega <- 1.0 / s2 v <- 1.0 / (tau + omega) mu <- omega * theta.hat / (tau + omega) if(! all(c(length(mu), length(v), length(chi))) == 1) stop("E") postprob <- NULL if(bayes) { # Set prior on the treatment contrast pcon <- list(sd=0.7073, c1=list(tx=1), c2=list(tx=0), contrast=expression(c1 - c2) ) f <- blrm(y ~ tx, pcontrast=pcon, loo=FALSE, method='sampling', iter=1000) dr <- f$draws[, 'tx'] postprob <- mean(dr < log(orcut)) # Pr(beta<log(orcut)) = Pr(OR<orcut) } c(mu = as.vector(mu), v = as.vector(v), pval = as.vector(1.0 - pchisq(chi, 1)), postprob = postprob)}# Define a function that can be run on split simulations on# different coresg <- function() { nsim <- 1000 # Function to do simultions on one core run1 <- function(reps, showprogress, core) { R <- matrix(NA, nrow=reps, ncol=4, dimnames=list(NULL, c('mu', 'v', 'pval', 'postprob'))) for(i in 1 : reps) { showprogress(i, reps, core) R[i, ] <- sim1(300, phigh, 0.65, bayes=TRUE) } list(R=R) } # Run cores in parallel and combine results # 3 cores x 4 MCMC chains = 12 cores = machine maximum runParallel(run1, reps=nsim, seed=1, along=1, cores=3)}R <- runifChanged(g, file='k3.rds') # 15 min.```Compute frequentist power and exact (to within simulation error) and approximate Bayesian powper.```{r}u <-as.data.table(R)pp <-function(mu, v) pnorm((log(1.0) - mu) /sqrt(v))u[, approxbayes :=pp(mu, v)]cnt <-c(1:5, 10, 25, 50)ggplot(u, aes(x=approxbayes, y=postprob)) +stat_binhex(aes(fill=cut2(after_stat(count), cnt)), bins=75) +geom_abline(alpha=0.3) +guides(fill=guide_legend(title='Frequency')) +xlab('Approximate Posterior Probability OR < 1') +ylab('Exact Posterior Probability')h <-function(x) c(Power =round(mean(x), 3))u[, rbind('Frequentist'=h(pval <0.05),'Bayesian'=h(postprob >0.975),'Approximate Bayesian'=h(approxbayes >0.975))]```The posterior distribution is extremely well approximated by a normal distribution at a sample size that matters. Bayesian power is just slightly below frequentist power because frequentist test statistics do not place limits on the treatment effect, e.g., allow the odds ratio to be unrealistically far from 1.0.Use the approximation for proportional odds model Bayesian power going forward. Compute power for the primary endpoint's sample size.```{r k3n}R <- expand.grid(n = Nprime, rep=1:1000)setDT(R)g <- function() { set.seed(17) R[, as.list(sim1(n, phigh, or, bayes=FALSE)), by=.(n, rep)]}S <- runifChanged(g, R, or, sim1, file='k3n.rds')S[, pphi := 1 * (pp(mu, v) > 0.975), by=.(n, rep)]Infpower <- S[, mean(pphi)]Infpower```The Bayesian power to detect an OR of 0.65 at a sample size of `r Nprime` per group is `r Infpower`.#### Bayesian Power for Different Treatment Effect for Infection and DeathThe proportional odds (PO) model can provide a correct assessment of which treatment is better overall even when [PO is not satisfied](https://fharrell.com/post/powilcoxon). But with non-PO the power will not be optimal, and estimates of outcome probabilities for individual outcomes can be off. To estimate Bayesian power when PO is not assumed, allow the treatment effect (OR) for I|D (infection unioned ("or"d) with death) to differ from the OR for D. In the $Y=0,1,2$ notation we have one OR for $Y\geq 1$ and a separate OR for $Y=2$. This is the partial PO model of Peterson & Harrell (1990). The amount of borrowing for the I|D treatment effect and the D effect is specified by a prior for the ratio of the I|D and D ORs. We assume the log of the ratio of the ORs is normal with mean zero and variance such that the probability that the ratio of ratios is between $\frac{1}{3}$ and $3$ is 0.95. The standard deviation accomplishing this is $\frac{\log(3)}{1.96} = 0.5605$. Also simulate the power of the treatment comparison on I|D when there is no borrowing of treatment effect information across I and D, i.e., when the prior for the non-PO effect (here the special effect of treatment on D) is non-informative. This partial PO model assumes nothing about how the treatment effect on I|D compares to the effect on D.The sample size is 302 per group to yield Bayesian power of 0.8 against OR=0.65 in PO holds. For this sample size, simulations showed the loss of power by not assuming PO was minimal when the data satisfied PO. So the simulations are run with $n=75$ per group to better understand power implications for varying amounts of information borrowing.The R `rmsb` package's `blrm` function has the user specify priors for departures from PO through contrasts and through a "departure from PO" function (the `cppo` argument). The contrast for the differential treatment effect over outcome components in the information-borrowing case is the same one that we apply to the main treatment effect except for using a standard deviation involving $\log(3)$ instead of $\log(4)$.When the statistical analysis plan for the study is detailed, more simulations will be run that use a non-PO data generating mechanism, but for now data will be simulated the same as earlier, assuming PO for treatment. ```{r k3npo}sim1npo <- function(n, pcontrol, or, orcut=1, peffect=FALSE) { # n: number of pts in each group # pcontrol: vector of outcome probabilities for control group # or: odds ratio to detect for I|D # orcut: Compute Pr(OR < orcut) # Set peffect=TRUE to debug # Compute outcome probabilities for new treatment ptr <- pomodm(p=pcontrol, odds.ratio=or) k <- length(pcontrol) - 1 tx <- c(rep(0, n), rep(1, n)) y <- c(sample(0 : k, n, replace=TRUE, prob=pcontrol), sample(0 : k, n, replace=TRUE, prob=ptr) ) # Set prior on the treatment contrast pcon <- list(sd=0.7073, c1=list(tx=1), c2=list(tx=0), contrast=expression(c1 - c2) ) # PO model first f <- blrm(y ~ tx, pcontrast=pcon, loo=FALSE, method='sampling', iter=1000) g <- blrm(y ~ tx, y ~ tx, cppo=function(y) y == 2, pcontrast=pcon, loo=FALSE, method='sampling', iter=1000) # Set prior on the "treatment by Y interaction" contrast npcon <- pcon npcon$sd <- 0.5605 h <- blrm(y ~ tx, y ~ tx, cppo=function(y) y == 2, pcontrast=pcon, npcontrast=npcon, loo=FALSE, method='sampling', iter=1000) if(peffect) { print(f) print(g) print(h) print(rbind(po=c(coef(f), NA), flat=coef(g), skeptical=coef(h))) return() } df <- f$draws[, 'tx'] # log OR for I|D dg <- g$draws[, 'tx'] # = log OR for D for PO dh <- h$draws[, 'tx'] # Pr(beta < 0) = Pr(OR < 1) lor <- log(orcut) c(po = mean(df < lor), flat = mean(dg < lor), skeptical = mean(dh < lor))}g <- function() { nsim <- 1000 # Function to do simultions on one core run1 <- function(reps, showprogress, core) { R <- matrix(NA, nrow=reps, ncol=3, dimnames=list(NULL, c('po', 'flat', 'skeptical'))) for(i in 1 : reps) { showprogress(i, reps, core) R[i,] <- sim1npo(N, phigh, 0.65) } list(R=R) } # Run cores in parallel and combine results # 3 cores x 4 MCMC chains = 12 cores = machine maximum runParallel(run1, reps=nsim, seed=11, cores=3, along=1)}N <- 75R <- runifChanged(g, sim1, N, file='k3npo.rds')bp <- apply(R, 2, function(x) mean(x >= 0.975))bp```For $n=75$ per group, the Bayesian power is `r bp['po']` when completely borrowing information from D to help inform the treatment effect estimate for D|I. Allowing for arbitrarily different effects (non-PO) reduces the power to `r bp['flat']`, and assuming some similarity of effects yields a power of `r bp['skeptical']`.### Bayesian Power for 6-Level Ordinal Non-Inferiority OutcomeSuppose that the 6-level ECOG performance status assessed at 30 days is another secondary endpoint, and is considered to be a non-inferiority endpoint. Assessment of non-inferiority is more direct with a Bayesian approach and doesn't need to be as conservative as a frequentist analysis, as we are assessing evidence for a broader range of treatment effects that includes slight harm in addition to the wide range of benefits. The outcome will be analyzed using a Bayesian proportional odds model with the same prior on the log odds ratio as has been used above. The 6 cell probabilities assumed for the control group appear in the top line in the table below.The estimated mortality by 30d is `r rnd(m30)`. ```{r}# Start with relative probabilities of Y=0-4 and make them# sum to one minus 30d mortalityphigh <-c(42, 41, 10, 1, 1) /100phigh <-c(phigh /sum(phigh) * (1.- m30), m30)or <-0.65pmid <-pomodm(p=phigh, odds.ratio=sqrt(or))plow <-pomodm(p=pmid, odds.ratio=sqrt(or))n <-round(posamsize(pmid, odds.ratio=or, power=0.8)$n /2)Perfpowf <-popower(pmid, odds.ratio=or, n = Nprime *2)$powerg <-function(x) {names(x) <-paste0('Y=', (1:length(x)) -1)round(x, 3)}rbind('Standard treatment'=g(phigh),'New treatment'=g(plow),'Average'=g(pmid) )```The second line in the table are the outcome probabilities when an odds ratio of 0.65 is applied to the standard treatment probabilities. The last line is what is given to the `posampsize` function to compute the sample size need for an $\alpha=0.05$ two-sided Wilcoxon / proportional odds test for differences in the two groups on the 6-level outcome severity with power of 0.8. This is also used for `popower`. The needed sample size is `r n` per group were a frequentist test to be used. This is for a superiority design. The frequentist power for `r Nprime` per group is `r Perfpowf`.For performance status a non-inferiority design is used, so Bayesian power will differ more from frequentist superiority power, as the Bayesian posterior probability of interest is $\Pr(\text{OR} < 1.2)$. Simulate Bayesian power using the normal approximation.```{r k6}g <- function() { set.seed(14) or <- 0.65 R <- expand.grid(n = Nprime, rep=1:1000) setDT(R) R[, as.list(sim1(n, phigh, or, orcut=1.2, bayes=FALSE)), by=.(n, rep)]}S <- runifChanged(g, sim1, file='k6.rds') # 16spp <- function(mu, v) pnorm((log(1.2) - mu) / sqrt(v))S[, pphi := 1 * (pp(mu, v) > 0.975), by=.(n, rep)]Perfpowb <- S[, mean(pphi)]```Bayesian power to detect an OR of 0.65 when seeking evidence for OR < 1.2 with `r Nprime` patients per group is `r Perfpowb`.## Sample Size Needed For Prior Insensitivity {#sec-design-pn}One way to choose a target sample size is to solve for $n$ that controls the maximum absolute difference in posterior probabilities of interest, over the spectrum of possible-to-observe data. Let * $n$ = sample size in each of two treatment groups* $\Delta$ = unknown treatment effect expressed as a difference in means $\mu_{1} - \mu_{2}$* the data have a normal distribution with variance 1.0* $\hat{\Delta}$ = observed difference in sample means $\overline{Y}_{1} - \overline{Y}_{2}$ * $V(\hat{\Delta}) = \frac{2}{n}$* Frequentist test statistic: $z = \frac{\hat{\Delta}}{\sqrt{\frac{2}{n}}}$* $\delta$ assumed to be the difference not to miss, for frequentist power* One-sided frequentist power for two-sided $\alpha$ with $z$ critical value $z_{\alpha}$: $\Pr(z > z_{\alpha}) = 1 - \Phi(z_{\alpha} - \frac{\delta}{\sqrt{\frac{2}{n}}})$Power curves are shown below for $\alpha=0.05$.```{r}#| label: design-fig-zpower#| fig.width: 5#| fig.height: 3.75d <-expand.grid(n=c(1:19, seq(20, 200, by=5)),delta =seq(0, 2, by=0.25))ggplot(d, aes(x=n, y=1-pnorm(qnorm(0.975) - delta /sqrt(2/ n)), color=factor(delta))) +geom_line() +guides(color=guide_legend(title=expression(delta))) +scale_x_continuous(minor_breaks=seq(0, 200, by=10)) +scale_y_continuous(minor_breaks=seq(0, 1, by=0.05)) +xlab('n') +ylab(expression(paste('Power at ', alpha==0.05)))```* Assume skeptical prior for $\Delta$ that is normal with mean 0 and variance $\frac{1}{\tau}$* Posterior distribution for $\Delta | \hat{\Delta}$ is normal + variance = $\frac{1}{\tau + \frac{n}{2}}$ + mean = $\frac{n\hat{\Delta}}{n + 2\tau}$```{r}pp <-function(tau, n, delta, Deltahat)round(1-pnorm((delta - n * Deltahat / (n +2* tau)) *sqrt(tau + n /2)), 4)# Compute prior SD such that P(Delta > delta) = probpsd <-function(delta, prob) round(delta /qnorm(1- prob), 4)```<a id="pmislead"></a>* Example: Probability of misleading evidence for efficacy + agreed-upon prior was such that $\Pr(\Delta > 0.5) = 0.05 \rightarrow$ SD of the prior is `r s1 <- psd(0.5, 0.05); s1` + someone disagrees, and has their own posterior probability of efficacy based on their prior, which was chosen so that $\Pr(\Delta > 0.5) = 0.005 \rightarrow$ SD of their prior is `r s2 <- psd(0.5, 0.005); s2` + suppose that $\hat{\Delta}$ at $n=100$ is 0.3 + $\rightarrow$ posterior probability of $\Delta > 0$ using pre-specified prior is `r pp1 <- pp(1/s1, 100, 0, 0.3); pp1` + $\rightarrow$ posterior probability of $\Delta > 0$ using the critic's prior is `r pp2 <- pp(1/s2, 100, 0, 0.3); pp2` + there is substantial evidence for efficacy in the right direction + probability of being wrong in acting as if the treatment works is `r 1 - pp1` to those who designed the study, and is `r 1 - pp2` to the critic + the two will disagree more about evidence for a **large** effect, e.g. $\Pr(\Delta > 0.2)$ + $\Pr(\Delta > 0.2)$ using pre-specified prior: `r pp1 <- pp(1/s1, 100, 0.2, 0.3); pp1` + $\Pr(\Delta > 0.2)$ for the critic: `r pp2 <- pp(1/s2, 100, 0.2, 0.3); pp2` + Probabilities of being wrong in acting as if the treatment effect is large: `r 1 - pp1`, `r 1 - pp2`* Sample size needed to overcome effect of prior depends on the narrowness of the prior ($\tau; \sigma$) + for posterior mean and variance to be within a ratio $r$ of those from a flat prior $n$ must be $> \frac{2\tau r}{1 - r}$ + within 5% for $n > 38\tau$ + within 2% for $n > 98\tau$ + within 1% for $n > 198\tau$ + typical value of $\sigma = \frac{1}{\tau} = 0.5$ so $n > 196$ to globally be only 2% sensitive to the prior* Posterior $\Pr(\Delta > \delta | \hat{\Delta}) = 1 - \Phi((\delta - \frac{n\hat{\Delta}}{n + 2\tau})\sqrt{\tau + \frac{n}{2}})$* Choose different values of the prior SD to examine sensitivity of cumulative posterior distribution to prior* Below see that the posterior probability of _any_ treatment benefit (when $\delta=0$) is not sensitive to the prior* Probability of benefit $> \delta$ for large $\delta$ is sensitive to the prior + priors having large probability of large $\Delta$ lead to higher posterior probabilities of large $\Delta > \delta$ when $\delta$ is large* For each $n, \delta$, and SDs compute the absolute difference in posterior probabilities that $\Delta > \delta$* First look at the result when assessing evidence for _any_ efficacy ($\delta=0$) when the most extreme priors ($\sigma=\frac{1}{8}$ vs. $\sigma=8$) are compared* Don't consider conditions when both posterior probabilities are in [0.1, 0.9] where disagreements are probably inconsequential```{r}require(data.table)sigmas <-c(1/8, 8)w <-expand.grid(n =c(1:30, seq(35, 200, by=5)),Deltahat =seq(-1, 5, by=0.025))setDT(w)z <-lapply(1:nrow(w),function(i) w[i, .(n, Deltahat,p1 =pp(8, n, 0, Deltahat),p2 =pp(1/8, n, 0, Deltahat) )] )u <-rbindlist(z)# Ignore conditions where both posterior probabilities are in [0.1, 0.9]# Find maximum abs difference over all Deltahatm <- u[, .(mdif =max(abs(p1 - p2) * (p1 <0.1| p2 <0.1| p1 >0.9| p2 >0.9))), by=n]ggplot(m, aes(x=n, y=mdif)) +geom_line() +scale_x_continuous(minor_breaks=seq(0, 200, by=10)) +scale_y_continuous(minor_breaks=seq(0, 0.2, by=0.01)) +ylab(expression(paste('Maximum |Difference| in ', Pr(Delta > delta ~"|"~hat(Delta))))) ```* By $n=170$ the largest absolute difference in posterior probabilities of any efficacy is 0.01* The difference is 0.02 by $n=80$* Sample sizes needed for frequentist power of 0.9 are $n=38$ for $\Delta=0.5$ and $n=85$ for $\Delta=0.25$* Now consider the full picture where evidence for efficacy $> \delta$ where $\delta \neq 0$ is included* Vary $n, \delta, \hat{\Delta}$ and the two prior SDs* Don't show points where the difference < 0.0025 or when both posterior probabilities are in [0.1, 0.9]* In the graph one of the prior SDs is going across, and the other is vertical; diagonals where the two SDs are equal are not needed```{r results='asis'}#| fig.width: 8.5#| fig.height: 7.75#| column: screen-right# For a given n and delta two values of SDs compute the absolute difference# of the two posterior probabilities that Delta > delta when one of the posterior# probabilities was < 0.1 or > 0.9 and when the difference exceeds 0.0025adif <- function(n, delta, s1, s2) { tau1 <- 1 / s1 tau2 <- 1 / s2 Deltahat <- seq(-1, 5, by=0.05) p1 <- pp(tau1, n, delta, Deltahat) p2 <- pp(tau2, n, delta, Deltahat) i <- abs(p1 - p2) >= 0.0025 & (pmin(p1, p2) < 0.1 | pmax(p1, p2) > 0.9) data.table(n, delta, s1, s2, Deltahat, ad=abs(p1 - p2))[i]}taus <- c(1/8, 1/4, 1/2, 1, 2, 4, 8)sigmas <- 1 / tausdels <- seq(-1, 1, by=0.5)w <- expand.grid(n = c(10, 15, seq(20, 400, by=10)), delta = dels, s1 = sigmas, s2 = sigmas)setDT(w)w <- w[s2 > s1]z <- lapply(1 : nrow(w), function(i) w[i, adif(n, delta, s1, s2)])u <- rbindlist(z)u[, adc := cut2(ad, c(0, 0.005, 0.01, 0.02, 0.03, 0.04, 0.05, 0.075, 0.1, 0.15, 0.2, 0.4))]g <- vector('list', length(dels))names(g) <- paste0('δ=', dels)for(del in dels) g[[paste0('δ=', del)]] <- ggplot(u[delta==del], aes(x=n, y=Deltahat, color=adc)) + geom_point(size=0.3) + facet_grid(s1 ~ s2) + guides(color=guide_legend(title='|Difference|')) + theme(legend.position='bottom') + ylab(expression(hat(Delta))) + labs(caption=bquote(delta == .(del)))maketabs(g, initblank=TRUE)```* "Mouth" from the left of each panel corresponds to regions where both posterior probabilities are in [0.1, 0.9]* Most extremely different priors examined had SD=0.25 (skeptical of large $\Delta$) vs. SD=8 (large $\Delta$ are likely)* For $\delta=0$ differences vanish for $n > 200$* For $\delta=-1$ differences are greatest when observed differences in means $\hat{\Delta} = -1$* For $\delta=1$ differences are greatest when $\hat{\Delta} = 1$* Required $n=400$ for influence of prior to almost vanish everywhere when the most different priors are paired* More realistic to compare somewhat skeptical SD=1 with skeptical SD=0.25* Larger discrepancy when $\delta=-1$ and $\hat{\Delta}=-1$* $n > 300$ makes for small discrepancyAnother display of the same conditions except for fewer values of $\hat{\Delta}$.```{r}#| fig.width: 8.5#| fig.height: 7.75#| column: screen-rightw <-expand.grid(n =c(10, 15, seq(20, 400, by=10)),delta =seq(-1, 1, by=0.5),s = sigmas,Deltahat =c(-1, -0.5, 0, 0.25, 0.5, 0.75, 1, 1.5))setDT(w)z <-lapply(1:nrow(w), function(i) w[i, .(n, delta, s, Deltahat, pp=pp(1/s, n, delta, Deltahat))])u <-rbindlist(z)ggplot(u, aes(x=n, y=pp, color=factor(s))) +geom_line() +facet_grid(paste0('delta==', delta) ~paste('hat(Delta)==', Deltahat), label='label_parsed') +guides(color=guide_legend(title=expression(paste('Prior ', sigma)))) +ylab(expression(Pr(Delta > delta ~"|"~hat(Delta)))) +theme(legend.position='bottom')```See @rig23con for related ideas.## Sequential Monitoring and Futility Analysis::: {.quoteit}I want to make it quite clear that I have no objection with flexible designs where they help us avoid pursuing dead-ends. The analogous technique from the sequential design world is stopping for futility. In my opinion of all the techniques of sequential analysis, this is the most useful. --- @sen13bei:::* Compute PP as often as desired, even continuously* Formal Bayesian approach can also consider having adequate safety data + Stop when PP > 0.95 for any efficacy **and** PP > 0.95 for safety event hazard ratio ≤ 1.2* @spi93app argue for use of 'range of equivalence' in monitoring* Bayesian design is very simple if no cap on n* But often reach a point where clinically important efficacy unlikely to be demonstrated within a reasonable n* Futility is more easily defined with Bayes. 3 possible choices: + Stop when P(efficacy < ε) is > 0.9 + Use posterior _predictive distribution_ at planned max sample size (@ber06bay) + Pre-specify max n and continue until that n if trial not stopped early for efficacy or harm (more expensive)* Insightful approximation of @spi93app + use Z-statistic at an interim look + assume fraction f of subjects randomized to date + based only on observables```{r zfutil}#| fig.cap: 'Predictive probability of the final 0.95 credible interval excluding zero, and the treatment effect being in the right direction, given the fraction f of study completed and the current test statistic Z when the prior is flat. f values are written beside curves.'spar(bty='l')pf <- function(z, f) pnorm(z/sqrt(1 - f) - 1.96 * sqrt(f) / sqrt(1 - f))zs <- seq(-1, 3, length=200)fs <- c(.1, .25, .4, .75, .9)d <- expand.grid(Z=zs, f=fs)f <- d$fd$p <- with(d, pf(Z, f))d$f <- NULLp <- split(d, f)labcurve(p, pl=TRUE, ylab='Predictive Probability')```* Example: to have reasonable hope of demonstrating efficacy if an interim Z value is 1.0, one must be less than 1/10th of the way through the trial* For example simulations of futility see [this](https://hbiostat.org/R/Hmisc/gbayesSeqSim.html)::: {.quoteit}The published report of any study attempts to answer the crucial question: _What does this trial add to our knowledge?_ The strength of the Bayesian approach is that it allows one to express the answer _formally_. It therefore provides a rational framework for dealing with many crucial issues in trial design, monitoring and reporting. In particular, by making explicitly different prior opinions about efficacy, and differing demands on new therapies, it may shed light on the varying attitudes to balancing the clinical duty to provide the best treatment for individual patients, against the desire to provide sufficient evidence to convince the general body of clinical practice. --- @spi93app:::## Example Operating Characteristics {#sec-design-eop}::: {.quoteit}```{r echo=FALSE}R <- Rdif(min(length(R$propconc), length(R$correct), length(R$stopn), length(R$stopany)) == 0) propeff <- cor <- eff.cor <- eff.any <- propineff <- ineff.cor <- ineff.times <- ineff.q1 <- ineff.q2 <- stopmean <- fracnone <- ' ?? ' else { propeff <- R$propconc['non-trivial efficacy', '0.2'] cor <- R$correct[kappa == 0.2] eff.cor <- cor[conclusion == 'non-trivial efficacy', 6] eff.any <- cor[conclusion == 'non-trivial efficacy', 7] propineff <- R$propconc['inefficacy', '0.2'] ineff.cor <- cor[conclusion == 'inefficacy', 4] ineff.times <- R$stopn[conclusion == 'inefficacy' & kappa == 0.2] ineff.q1 <- ineff.times[, Q1] + 1 ineff.q2 <- ineff.times[, Median] + 1 stopmean <- R$stopany[kappa == 0.001, Mean] fracnone <- R$propconc['none', '0.001'] }```Even with a disadvantageous mismatch between the prior used in the analysis and the prior used to generate the universe of efficacies, the design was able to stop early for evidence of non-trivial efficacy in `r propeff` of 10,000 trials over a skeptical distribution of true efficacies, and was right `r eff.cor` of the time when doing so. The design was right `r eff.any` of the time in terms of there being truly some efficacy when stopping for more than trivial efficacy. Stopping early for evidence of less than trivial efficacy was possible in `r propineff` of the trials and was the correct decision `r ineff.cor` of the time. One quarter of such trials stopped with fewer than `r ineff.q1` participants studied and half had a sample size less than `r ineff.q2`. Overall the average sample size at the time of stopping was `r stopmean` for the most skeptical prior, which is just over ½ what a traditional frequentist design would require (`r R$freqN`) just to have power to detect the MCID, still not providing evidence for non-trivial efficacy, inefficacy, or similarity. With the Bayesian sequential design only `r fracnone` of the trials ended without a conclusion for the most skeptical prior.:::For the Bayesian sequential design at the start of this chapter, simulate the operating characteristics for the following setting.* Two-arm parallel-group RCT with equal sample sizes per group* Response variable is normal with standard deviation 1.0* Estimand of interest is $\Delta$ = the unknown difference in means* MCID is $\delta=0.3$* Minimum clinically detectable difference is $\gamma=0.1$* Similarity interval is $\Delta \in [-0.15, 0.15]$ [With these similarity limits a large number of studies were stopped early for similarity. In one test with a narrower similarity interval, 0.035 of simulated studies were stopped early for similarity.]{.aside}* Maximum affordable sample size is 750 per treatment group* Required minimum precision in estimating $\Delta$: $\pm 0.4$ (half-width of 0.95 credible interval) + $\rightarrow$ minimum sample size $N_{p}$ before stopping for efficacy + Allow stopping for inefficacy at any time, even if the treatment effect cannot be estimated after stopping, i.e., one only knows there is a high probability that the effect $< \gamma$.* The first analysis takes place after the first participant has been enrolled and followed, then after each successive new participant's data are complete. The first look for efficacy is delayed until the $N_{p}^\text{th}$ participant has been followed so that the magnitude of the treatment effect has some hope of being estimated. [When using an ultra-high information continuous response variable as in this example, with the between-subject standard deviation known to be 1.0, the data having a normal distribution, and not delaying the first look, it is possible to make a decision with any one participant if she has a very impressive response. This could never happen with a binary endpoint or with unknown SD.]{.aside}* Prior for $\Delta$ is normal with mean 0 and SD $\sigma$ such that $\Pr(\Delta > 0.75) = \kappa$ where $\kappa$ is varied over ${0.2, 0.1, 0.05, 0.025}$ + This corresponds to $\sigma=$ `r kappas=c(0.2, 0.1, 0.05, 0.025); psd(0.75, kappas)` + These are skeptical priors, chosen because the treatment is assumed to be an incremental rather than a curative therapy`r R$freqN <- round(2 * ((1.96 + 1.282) / 0.3) ^ 2)`The assumed data model leads to a frequentist point estimate of $\Delta$ = $\hat{\Delta}$ = difference in two means, with variance $\frac{2}{n}$, where $n$ is the per-group sample size. The one-sided frequentist power for a two-tailed test with $\alpha=0.05$ is $1 - \Phi(1.96 - \frac{0.3}{\sqrt{\frac{2}{n}}})$, and setting this to 0.9 requires $n=2[\frac{1.96 + 1.282}{0.3}]^2 = `r R$freqN`$. 1.282 is $\Phi^{-1}(0.9)$.`r cNp <- function(moe) round(2 * (1.645 / moe) ^ 2)``r R$Np <- cNp(0.3)`To arrive at $N_{p}$, the minimum sample size needed to achieve non-embarrassing precision in estimating $\Delta$, keep in mind that skeptical priors for $\Delta$ push the posterior distribution for $\Delta$ towards zero, resulting in narrower posterior distributions. To be conservative in estimating $N_{p}$ use a flat prior. The 0.90 credible interval is $\hat{\Delta} \pm 1.645\sqrt{\frac{2}{n}}$. Setting the half-length of the interval to $\delta=0.3$ results in $n = N_{p} = 2(\frac{1.645}{0.3})^2 = `r R$Np`$ per group. [Were a margin of error of 0.2 been required, $N_{p}=`r cNp(0.2)`$, i.e., the first look for possible stoppage for efficacy would be at the $`r cNp(0.2)`^\text{th}$ participant. Setting the required precision to 0.085 would result in no early stopping for efficacy, which is a choice that some trial designers may make, but it requires massive justification for the choice of the maximum sample size.]{.aside}At a given interim look with a cumulative difference in means of $\hat{\Delta}$ the posterior distribution of $\Delta | \hat{\Delta}$ is normal with variance $\frac{1}{\tau + \frac{n}{2}}$ and mean $\frac{n\hat{\Delta}}{n + 2\tau}$ where $\tau$ is the prior precision, $\frac{1}{\sigma^2}$. Here is a function to compute the posterior probability that $\Delta > \delta$ given the current difference in means $\hat{\Delta}$.```{r}pp <-function(delta, Deltahat, n, kappa) { sigma <-psd(0.75, kappa) tau <-1./ sigma ^2 v <-1./ (tau + n /2.) m <- n * Deltahat / (n +2.* tau)1.-pnorm(delta, m, sqrt(v))}```A simple Bayesian power calculation for a single analysis after $n$ participants are enrolled in each treatment group, and considering only evidence for _any_ efficacy, sets the probability at 0.9 that the posterior probability of $\Delta > 0$ exceeds 0.95 when $\Delta=\delta$. The posterior probability PP that $\Delta > 0$ is $\Phi(\frac{n\hat{\Delta}/\sqrt{2}}{\sqrt{n + 2\tau}})$. Making use of $\hat{\Delta} \sim n(\delta, \frac{2}{n})$, the probability that PP > 0.95 is $1 - \Phi(\frac{za - \delta}{\sqrt{\frac{2}{n}}})$ where $z = \Phi^{-1}(0.95)$ and $a=\frac{\sqrt{2}\sqrt{n + 2\tau}}{n}$. ```{r}# Special case when gamma=0, for checkingbpow <-function(n, delta=0.3, kappa=0.05) { tau <-1./psd(0.75, kappa) ^2 z <-qnorm(0.95) a <-sqrt(2) *sqrt(n +2* tau) / n1.-pnorm((z * a - delta) /sqrt(2/ n))}bpow <-function(n, delta=0.3, gamma=0, kappa=0.05, ppcut=0.95) { tau <-1./psd(0.75, kappa) ^2 z <-qnorm(ppcut) a <- n +2.* tau1.-pnorm(sqrt(n /2) * ((a / n) * (gamma + z *sqrt(2/ a) ) - delta))}findx <-function(x, y, ycrit, dec=0) round(approx(y, x, xout=ycrit)$y, dec)R$bayesN <-round(findx(10:500, bpow(10:500), 0.9))bpow(R$bayesN)```Setting the Bayesian power to 0.9 results in $`r R$bayesN`$ when $\kappa=0.05$. Contrast this with the frequentist sample size of `r R$freqN`. To understand the stringency of adding the requirement that efficacy is non-trivial to the requirement of having a very high probability that _any_ efficacy exists, consider a sample size (45) and current mean difference (0.3) where the probability of any efficacy is 0.9. Compute the probability of non-trivial efficacy with that mean difference. The prior probability of a very large (0.75) or larger treatment effect is taken as $\kappa=0.05$.```{r}pp( 0, 0.3, 45, 0.05)pp(0.1, 0.3, 45, 0.05)pp( 0, 0.36, 45, 0.05)pp(0.1, 0.36, 45, 0.05)```At the point at which there is over 0.9 chance of any efficacy in the right direction, the probability of non-trivial efficacy is 0.78. To achieve a probability of 0.85 would require an observed mean difference of 0.36 which raises the probability of _any_ efficacy to 0.94.What if instead we wanted 0.9 Bayesian power to achieve a PP > 0.85 that $\Delta > \gamma$, the threshold for a detectable treatment effect? Let's compute over a range of $n$ the two relevant probabilities: of getting $\Pr(\Delta > 0) > 0.95$ and of getting $\Pr(\Delta > 0.10) >0.85$, and also compute Bayesian power for $\Pr(\Delta > 0.15) > 0.8$.```{r powerbycrit}gammas <- c(0, .1, .15)ppcuts <- c(0.95, 0.85, 0.8)w <- expand.grid(crit=1:3, n=seq(10, 750, by=5))setDT(w)w[, Criterion := paste0('Pr(efficacy > ', gammas[crit], ') > ', ppcuts[crit])]w[, p := bpow(n=n, gamma=gammas[crit], ppcut=ppcuts[crit])]ggplot(w, aes(x=n, y=p, col=Criterion)) + geom_line() + scale_y_continuous(breaks=seq(0, 1, by=0.1)) + ylab('Bayesian Power') + theme(legend.position='bottom')# Solve for n to get Bayesian power of 0.9p <- w[, .(n=findx(n, p, 0.9)), by=.(crit, Criterion)]navg <- if(length(R$stopn)) R$stopn[conclusion == 'non-trivial efficacy' & kappa==0.001, Mean]```For a fixed sample-size study the necessary $N$ to have a 0.9 chance that the posterior probability of more than trivial efficacy exceeds 0.85 is `r p[crit == 2, n]`.Later in this section the expected sample size to establishment of non-trivial efficacy will be estimated when continuous sequential analysis is allowed. This number is `r navg` for the most skeptical prior used. The more general expected sample size is that required on the average to reach any conclusion. This number is `r if(length(R$stopany)) R$stopany[kappa==min(kappa), Mean]`.### Decision ThresholdsBefore simulating the trials, let's compute the decision thresholds, i.e., the least impressive values of the observed difference in means $\hat{\Delta}$ that would trigger a given conclusion, as a function of the current sample size.Let $\hat{\Delta}$ be a regular sequence of values for each sample size. For the current purpose don't delay any efficacy looks, and treat each of the three reasons for stopping independently. For inefficacy, the $\hat{\Delta}$ threshold is the highest $\hat{\Delta}$ leading to stopping for inefficacy. For non-trivial efficacy, this is the lowest $\hat{\Delta}$ leading to stopping for non-trivial effectiveness. For similarity the threshold is an interval with the lower value being the minimum $\hat{\Delta}$ meeting the similarity criterion and the upper limit is the maximum $\hat{\Delta}$ meeting the criterion.Do the calculations for one $\kappa$.```{r decthresh}difs <- seq(-1.5, 1.5, by=0.001)ns <- 1 : 750d <- expand.grid(dif=difs, n=ns)setDT(d)thresh <- function(Deltahat, n, kappa=0.05) { # For an ascendeing Deltahat vector and single value of n # Compute posterior probability of any efficacy p1 <- pp(0., Deltahat, n, kappa) # Compute probability of non-trivial efficacy p2 <- pp(0.1, Deltahat, n, kappa) # Compute probability of similarity p3 <- pp(-0.15, Deltahat, n, kappa) - pp( 0.15, Deltahat, n, kappa) ineff <- max(Deltahat[p2 < 0.025]) eff <- min(Deltahat[p1 > 0.95 & p2 > 0.85]) simlow <- min(Deltahat[p3 > 0.95]) simhi <- max(Deltahat[p3 > 0.95]) x <- c(ineff, eff, simlow, simhi) x[is.infinite(x)] <- NA list(conclusion=c('inefficacy', 'non-trivial efficacy', 'similarity-low', 'similarity-high'), Deltahat=x)}w <- d[, thresh(dif, n, kappa=0.05), by=n]R$threshold <- ww[n %in% c(50, 100, 200, 750)]ggplot(w, aes(x=n, y=Deltahat, col=conclusion)) + geom_line() + scale_x_continuous(minor_breaks=seq(0, 675, by=25)) + scale_y_continuous(breaks=seq(-1.5, 1.5, by=0.5), minor_breaks=seq(-1.5, 1.25, by=0.1)) + ylab(expression(hat(Delta))) + labs(title='Decision thresholds in terms of observed differences in means')sim.min <- w[conclusion == 'similarity-low' & ! is.na(Deltahat), min(n)]sim.max <- w[conclusion == 'similarity-high' & n == 750, Deltahat]ineff100 <- w[conclusion == 'inefficacy' & n == 100, Deltahat]ineff200 <- w[conclusion == 'inefficacy' & n == 200, Deltahat]nte50 <- w[conclusion == 'non-trivial efficacy' & n == 50, Deltahat]nte200 <- w[conclusion == 'non-trivial efficacy' & n == 200, Deltahat]nte.min <- w[conclusion == 'non-trivial efficacy' & n == 750, Deltahat]```A few observations are in order.* Similarity can't be concluded until $n > `r sim.min`$* Even then the observed difference needs to be almost exactly zero to make that conclusion* At the maximum sample size $n=750$ similarity is concluded if the absolute value of the observed $\hat{\Delta}$ is `r sim.max` or less (recall that similarity was defined as $\Delta \in [-0.15, 0.15]$)* An inefficacy conclusion is possible if $\hat{\Delta} < `r ineff100`$ when $n=100$ and if $\hat{\Delta} < `r ineff200`$ when $n=200$* A non-trivial efficacy conclusion is possible if $\hat{\Delta} > `r nte50`$ when $n=50$ or when $\hat{\Delta} > `r nte200`$ when $n=200$* When $\hat{\Delta} = `r nte.min`$ then planned maximum sample size of $n=750$ is required to be able to conclude non-trivial efficacy. Recall that the MCID is 0.3 and the threshold for trival effect is $\Delta=0.1$.::: {.callout-note collapse="true"}## Decision Thresholds on $z$ ScaleSince the observed treatment effect $\hat{\Delta}$ has variance $\frac{2}{n}$ with $n$ participants per group, the $\hat{\Delta}$ thresholds are easily converted into the $z$ scale $\sqrt{\frac{n}{2}}\hat{\Delta}$.```{r}ggplot(w, aes(x=n, y=Deltahat *sqrt(n /2), col=conclusion)) +geom_line() +scale_x_continuous(minor_breaks=seq(0, 675, by=25)) +ylab(expression(z)) +labs(title='Decision thresholds on the z-scale')```The patterns are different from what is seen with a frequentist approach, as here we see that when $n > 50$ the critical value for $z$ increases. That is because we are demanding at least an 0.85 chance that the efficacy is non-trivial ($\Delta > \gamma$). The drop in the upper curve from the lowest $n$ for which a conclusion of non-trivial efficacy is possible ($n=7$) to $n=50$ is due to the influence of the skeptical prior for $\Delta$ when $n$ is small. When a flat analysis prior is used for $\Delta$, the upturn disappears.:::### SimulationNow we simulate 10,000 sequential trials for each $\kappa$ to evaluate the Bayesian operating characteristics. $\kappa$ and the true unknown treatment effect $\Delta$ are varied. Studies are allowed to proceed to an upper limit sample size of 750 in each group. Each study is stopped when any stopping criterion is satisfied. The first function simulates one complete trial, computing the cumulative summary statistic (difference in means) after 1, 2, ..., 750 participants in each treatment group.```{r}sim <-function(Delta, kappa, N=750) {# Simulate raw data for whole planned trial, # for each treatment y1 <-rnorm(N, 0., 1.) y2 <-rnorm(N, Delta, 1.) j <-1: N ybar1 <-cumsum(y1) / j ybar2 <-cumsum(y2) / j Deltahat <- ybar2 - ybar1 Deltahat # cumulative mean difference}```The next function executes the stopping rules for one trial, looking sequentially.```{r}sim1 <-function(Delta, kappa, ignore=59) { Deltahat <-sim(Delta, kappa) n <-1:length(Deltahat)# Compute posterior probability of any efficacy p1 <-pp(0., Deltahat, n, kappa)# Compute probability of non-trivial efficacy p2 <-pp(0.10, Deltahat, n, kappa)# Compute probability of similarity p3 <-pp(-0.15, Deltahat, n, kappa) -pp( 0.15, Deltahat, n, kappa)# For each decision find first sample size at which criterion was satisfied k <- n > ignore ineff <-which((p2 <0.025))[1] eff <-which(k & (p1 >0.95& p2 >0.85))[1] sim <-which(k & (p3 >0.95))[1] u <-function(r, n) list(conclusion=r, n=n)# If not stopped, trial went to maximum All <-c(ineff, eff, sim)if(all(is.na(All))) return(u('none', max(n))) first <-min(All, na.rm=TRUE)if(!is.na(ineff) && ineff == first) return(u('inefficacy', ineff))if(!is.na(eff) && eff == first) return(u('non-trivial efficacy', eff))u('similarity', sim)}```Run all 50,000 trials. For each $\Delta$ and $\kappa$ simulate one clinical trial. Do this over a very fine grid of $\Delta$ so that we can later smooth the individual results and do not need multiple trials for any one value of $\Delta$. $\Delta$s are drawn from a mean zero normal distribution with $\Pr(\Delta > 0.75) = 0.025$.```{r}Deltas <-qnorm(seq(0.0001, 0.9999, length=10000), sd=psd(0.75, 0.025))kappas <-c(0.2, 0.1, 0.05, 0.025, 0.01, 0.001)set.seed(3)d <-expand.grid(Delta=Deltas, kappa=kappas)setDT(d)d[, .q(conclusion, n) :=sim1(Delta, kappa), by=.I] # 7 sec.```Here are the analysis $\Delta$ priors. The distribution of the universe of efficacies used to generate the clinical trials has $\kappa=0.025$.```{r}#| fig-height: 5#| fig-width: 7x <-seq(-2.5, 2.5, length=300)w <-expand.grid(Delta=x, kappa=kappas)setDT(w)w[, sigma :=psd(0.75, kappa)]w[, density :=dnorm(Delta, sd=sigma)]ggplot(w, aes(x=Delta, y=density,col=paste0(kappa, ' (', round(sigma, 2), ')'))) +geom_line() +xlab(expression(Delta)) +ylab('') +guides(color=guide_legend(title=expression(kappa(sigma))))```For each $\kappa$ show the frequencies of stopping for the various conclusions.```{r}w <- d[, table(conclusion, kappa)]wR$propconc <-round(prop.table(w, margin=2), 3)```The prior skepticism $\kappa$ has greater impact on stopping for similarity and stopping with no conclusion.Examine the mean and median stopping time (final N) by reason for stopping and $\kappa$.```{r}qmean <-function(x) { qu <-round(quantile(x, (1:3)/4))list(Mean =round(mean(x)),Q1 = qu[1],Median = qu[2],Q3 = qu[3])}w <- d[, qmean(n), by=.(conclusion, kappa)]options(datatable.print.class=FALSE)print(w, nrows=100)R$stopn <- w```Even with the most skeptical prior of the four priors, more than half of the simulated studies which ultimately stopped for efficacy could be stopped with that conclusion after just `r R$Np` participants, the minimum required for precision. We also need to examine the stopping times counting the other reasons. Here are the mean and median number of participants when stopping is for any reason (including going to the maximum sample size of 750).```{r meanns}w <- d[, qmean(n), by=.(kappa)]wR$stopany <- w```For planning purposes the expected sample size may be taken as `r if(length(R$stopany)) R$stopany[kappa==0.001, Mean]`. Sequential analysis without the traditional frequentist limitations on the number of looks makes a large difference in trial efficiency while providing evidence in more dimensions.Now consider the most important aspect of the performance of the design: the primary Bayesian operating characteristic---correctness of decisions about treatment effect. For our simulation study we judge the alignment between the decision about the treatment effect and the (unknown to us and to the sequential analyses but known to the simulation code) true treatment effect that was in play when the data were simulated. So for each prior and each decision that was made, we estimate the distribution of true treatment effects that were in effect when combining all the simulated trials resulting in that decision. This is the distribution of the $\Delta$ values conditional on stopping (or going to the sample size limit of 750 for the no-conclusion simulations) and on the reason for stopping. Keep in mind that for the simulations $\Delta$s were drawn from $\kappa=0.025$.Examine distributions of underlying true $\Delta$ for each prior and each conclusion (reason for stopping).```{r deltadens}#| fig-width: 7#| fig-height: 5.5del <- c(-0.1, 0, 0.1, 0.3)ggplot(d, aes(x=Delta, color=conclusion)) + geom_density() + facet_wrap(~ kappa, ncol=2) + scale_x_continuous(breaks=seq(-1, 1, by=0.5), minor_breaks=seq(-1, 1, by=0.1)) + geom_vline(xintercept=del, col='blue', alpha=0.2) + xlab(expression(Delta)) + xlim(-1, 1) + theme(legend.position='bottom')``````{r}#| fig-height: 3.5sp <- d[, spikecomp(Delta, tresult='segments'), by=.(conclusion, kappa)]ggplot(sp) +geom_segment(aes(x=x, y=y1, xend=x, yend=y2, alpha=I(0.3))) +scale_y_continuous(breaks=NULL, labels=NULL) +ylab('') +facet_grid(conclusion ~ kappa) +geom_vline(xintercept=del, col='red', alpha=0.2) +xlab(expression(Delta))```For a conclusion of non-trivial efficacy, the true effect in play was rarely > 0.2. Let's look at the actual proportions of true $\Delta$s in various intervals vs. the conclusion made.```{r accuracy}z <- cube(d, .('Δ<0' = sum(Delta < 0), 'Δ<0.1' = sum(Delta < 0.1), '-0.15<Δ<0.15' = sum(Delta >= -0.15 & Delta <= 0.15), 'Δ>0.1' = sum(Delta > 0.1), 'Δ>0' = sum(Delta > 0)), by=.q(kappa, conclusion))zR$ncorrect <- zbar <- function(x) round(mean(x), 2)z <- cube(d, .('Δ<0' = bar(Delta < 0), 'Δ<0.1' = bar(Delta < 0.1), '-0.15<Δ<0.15' = bar(Delta >= -0.15 & Delta <= 0.15), 'Δ>0.1' = bar(Delta > 0.1), 'Δ>0' = bar(Delta > 0)), by=.q(kappa, conclusion))R$correct <- zg <- function(w, kappa) { u <- copy(w) if(is.na(kappa)) kappa <- 'All' vline <- paste(rep('-', 60), collapse='') cat('\n', vline, '\nκ=', kappa, '\n', sep='') u[, conclusion := ifelse(is.na(conclusion), 'All', conclusion)] print(u) invisible()}u <- z[, g(.SD, kappa), by=kappa]z <- z[kappa == 0.2]nte.cor <- z[conclusion == 'non-trivial efficacy', 6]nte.any <- z[conclusion == 'non-trivial efficacy', 7]ineff.any <- z[conclusion == 'inefficacy', 3]ineff.cor <- z[conclusion == 'inefficacy', 4]noconc <- R$propconc['none', '0.001']```The least favorable result is obtained when $\kappa=0.2$ when the analysis prior is significantly less skeptical than the way the $\Delta$s were generated.To highlight two important points from that first part of the table, for trials that were stopped with a conclusion of non-trivial efficacy, `r nte.cor` of such trials were truly generated by $\Delta > 0.1$. But `r nte.any` had any efficacy ($\Delta > 0$). For trials stopped for inefficacy, `r ineff.any` of them were generated by a harmful treatment. `r ineff.cor` of them were generated by a harmful or at most only trivially effective treatment. Inconclusive studies tended to have treatment effect near $\Delta = 0.1$. Only about `r noconc` of the simulated trials resulted in no conclusion when $\kappa=0.001$, i.e., wasted resources by randomizing 750 participants to each treatment arm only to have no conclusion at all.To better understand the "no conclusion" results here are the quartiles and 0.1 and 0.9 quantiles of true $\Delta$s in effect when trials went to the maximum sample size without stopping.```{r}g <-function(x) { z <-quantile(x, c(10, 25, 50, 75, 90)/100)as.list(round(z, 2))}d[conclusion =='none', g(Delta), by=kappa]```The median true $\Delta$ when no conclusion could be reached was very near $\gamma = 0.1$, the threshold for triviality of treatment effect. So $\frac{1}{2}$ of the futile trials were generated by trivial efficacy. Very few if any of the trials were generated by a completely ineffective treatment ($\Delta < 0$).Finally, examine the distribution of true $\Delta$ values as a function of deciles of stopping time, when the conclusion was non-trivial efficacy. ```{r}#| fig-height: 3sp <- d[conclusion=='non-trivial efficacy',spikecomp(Delta, tresult='segments'), by=cut2(n, g=10)]ggplot(sp) +geom_segment(aes(x=x, y=y1, xend=x, yend=y2, alpha=I(0.3))) +scale_y_continuous(breaks=NULL, labels=NULL) +ylab('') +facet_wrap(~ cut2) +geom_vline(xintercept=del, col='red', alpha=0.2) +xlab(expression(Delta))```Many studies stopped at the minimum of N=`r R$Np` and these were reliable in judging the treatment to have more than trivial efficacy.`r saveRDS(R, 'design.rds')`## Borrowing Information* When the treatment has been well studied in prior randomized clinical trials it is reasonable to use at least some of that information in the prior distribution for the treatment effect in the new study* Some discounting is required because there may be doubt about + similarity of patients + general time trends in outcomes + changes in concomitant therapies that could detract from efficacy estimated in patients before such therapies were in use + changes in how outcomes are measured including duration of follow-up* Some ideal situations for borrowing of information include + use of a device on a bone that is not the bone previously studied, or use on skin on another part of the body from that studied + studying a device that is a tweak of a device that has been extensively studied previously + use of a drug on a different patient population where the difference in patients is not directly related to what the drug is intended to prevent + use of adult data about a drug to inform efficacy for treating the same disease in children* The posterior probability distribution of the treatment effect from the earlier study/studies becomes part of the prior* Important to include results from randomized studies that showed unimpressive treatment effects along with those that were "positive"; no cherry picking allowed* Discounting can be done a variety of ways, including power priors and mixtures of priors* Mixtures of priors are appealing and more interpretable* Mix a prior $f_1$ that is somewhat skeptical of large treatment effects from the new data alone with an optimistic prior $f_2$ from previous studies* Prior for the new study: $f(\theta) = (1 - a) f_{1}(\theta) + a f_{2}(\theta)$* $a$ may be directly interpreted as the applicability of the previous data to the current comparison* Applicability can be thought of in two ways (see below); for purpose of Bayesian analysis which of the two you prefer does not matter* For either notion of applicability, uncertainties (width of posterior distribution from previous studies) must be carried forward* Applicability is a number between 0 and 1 and is either + The degree to which you think that relative efficacy of treatment as learned from previous trials applies as the relative efficacy in the current trial or + The probability that what is learned about relative efficacy in previous studies fully applies to the new study* To anchor the values: + $a=0 \rightarrow$ previous data should be completely ignored + $a=1 \rightarrow$ previous data are completely relevant and the new study is unnecessary* To justify a value of $a$ one should undertake a survey on a reasonable number (say 12-20) of experts including, for drug studies, clinical pharmacologists familiar with absorption and metabolism of the drug in different patient types* See [this](https://www.fda.gov/media/107649/download) for excellent work the FDA has done related to borrowing adult data for pediatric drug trials**But** it is important to note that strong arguments can be made for including the raw data from other studies in a joint analysis with the new trial data, instead of summarizing the other data through a posterior distribution that is converted to a prior and then down-weighted. Down-weighting can come from adding a bias parameter for the historical data when forming the joint new data/old data Bayesian models. **Major** advantages of this approach include* using a standard simple prior (e.g., Gaussian) for discounting* making assumptions more explicit* explicitly modeling bias* getting more accurate analysis, for example by not assuming normality for a point estimate from historical data* covariate adjustmentThis last point is all-important. If there is a shift in the distribution of important prognostic factors between two data sources, this shift can be handled elegantly and simply using standard covariate adjustment. This is far more effective and efficient than matching or playing with inclusion criteria. Without analyzing the raw data, the effect of covariate shifts is very hard to counter, especially when the outcome model is nonlinear. It is often unclear what part of the covariate distribution (e.g., shift in means vs. a shift in only one of the tails) should be accounted for when using summary and not raw data.A simple case study in joint raw data modeling for incorporating historical control data into an RCT may be found at [fharrell.com/post/hxcontrol](https://fharrell.com/post/hxcontrol). The joint analysis is facilitated by the Bayesian approach, which allows one to specify any number of models for joint analysis. When the models share one or more parameters, interesting things happen. The case study illustrates that* When the new RCT has some but not enough control patients, and control information from historical data is incorporated into the analysis, the prior distribution for the bias in the historical data is important. The amount of bias can be estimated.* When the new RCT has no controls, so that control information is coming only from the historical data, the prior distribution for the bias in the historical data is **incredibly** important. The amount of bias cannot be estimated from the data at hand, and comes solely from the bias parameter's prior distribution. This truly reflects the limitation of not having concurrent controls.### Example Joint Data Model* RCT control: $Y \sim n(\mu_{c}, \sigma)$* RCT active: $Y \sim n(\mu_{a}, \sigma)$* Historical control: $Y \sim n(\mu_{c} + b, \sigma)$* Bias $b \sim n(0, \tau)$* $\tau$ controls amount of borrowing* $\tau = \infty$: historical data irrelevant* $\tau = 0$: historical data trusted as much as within-study control data + Historical data and concurrent controls are estimating the same thing### Extension to Extrapolation* Let $X$ be the extrapolation factor (e.g., age)* At the root is treatment $\times X$ interaction* Model: $Y \sim n(\beta_{0} + \beta_{1}Tx + \beta_{2}X + \beta_{3}\text{Tx} \times X, \sigma)$* No restriction on $\beta_{3} \rightarrow$ new study stands on its own* Skeptical prior on $\beta_{3} \rightarrow$ borrow information* Can also add a nonlinear interaction with a more skeptical prior[fharrell.com/post/ia](https://www.fharrell.com/post/ia) quantifies how non-overlap of $X$ distributions relates to the precision of $X$-specific treatment effects.