An approximate, computationally simple posterior distribution for a single covariate Cox model

The data

Consider a time-to-event endpoint with possible censoring. The variable \(\text{event}\) denotes event or censoring status, and the variable \(\text{time}\) denotes the corresponding time. The exposure of primary interest is treatment assignment, which we denote as \(\text{trt}\). A subset of the data might look something like the following.

generate_data <- function(params){
  lro <- params$lro
  k <- params$k
  l <- 15/(log(2)^(1/k))
  N <- 1200
  
  q2 <- function(y, lro) qweibull(1-(1 - y)^exp(-lro),k,l)
  
  g <- rbinom(N,1,.5)
  y <- q2(runif(N),0)
  y[g==1] <- q2(runif(sum(g==1)), lro)
  e <- rep(0,N)
  e[y<=28] <- 1
  y[y>28] <- 28
  y <- ceiling(y)
  d1 <- data.frame(row = 1:N, time=y, event=e, trt = g)
  params$data <- d1
  params
}

set.seed(1)
d1 <- list(k = 1.8, lro = -.3) %>% 
  generate_data() 

d1 %>% 
  le("data") %>% 
  filter(row < 5 | row > 1195) %>% 
  (function(x){ x[5,] <- "..."; x}) %>% 
  gt(auto_align=FALSE) %>% 
  cols_align("right")
row time event trt
1 12 1 0
2 22 1 0
3 5 1 1
4 5 1 1
... ... ... ...
1197 25 1 1
1198 5 1 1
1199 10 1 0
1200 8 1 0

The primary estimand

To summarize the differences between groups, we use the proportional hazards regression model. The transformed regression parameter \(\exp(\beta_1)\) is the hazard ratio, and is the primary quantity for inference.

\[ \log \text{relative hazard} = \beta_1 \text{trt}, \]

The estimation procedure

As has been described here, here, and here, the Cox model can be estimated from a Poisson generalized linear model. The dataset can be restructured so that the Poisson model generates identical estimates of the treatment effect as those achieved with direct estimation using the Cox model. That is, the restructured data combined with the following regression (where \(\alpha_t\) denotes an estimate for each inter-event time interval) will generate identical estimates (and standard errors) of \(\beta_1\).

\[ \log(\lambda) = \alpha_{t} + \beta_1 x \]

Because \(\beta_1\) is the quantity of primary interest, the estimation of several intercepts (\(\alpha_t\)) is a hassle, especially when our end goal is a Bayesian posterior distribution. However, additional manipulation of the dataset will admit a much simpler Poisson GLM which results in approximate but highly accurate estimates of \(\beta_1\). The intercept estimates are replaced with a single intercept (\(\beta_0\)) and an offset term, (\(\log(z)\)).

\[ \log(\lambda) = \beta_0 + \beta_1 x + log(z) \]

For details about the connection between the Poisson GLM and the Cox model, see the documents linked above. For demonstration purposes, we show the estimates of \(\beta_1\) from the Cox, Poisson GLM, and Poisson GLM with offset.

tolong <- function(params){
  dtimes <- params$data %>% 
    filter(event==1) %>% 
    pull(time) %>% 
    unique %>% 
    sort

  data_long <- survSplit(
        formula = Surv(time, event) ~ .
      , data = params$data
      , cut = dtimes
      , episode = "interval"
    ) %>%
    rename(tstop = time) %>% 
    mutate(tduration = tstop - tstart)

  counts <- c(table(data_long$interval))
  xmat <- as.matrix(data_long[,c('trt')])
  centers <- rowsum(xmat, data_long$interval) / counts
  data_long$x <- c(xmat - centers[data_long$interval,])
  phat <- with(data_long, tapply(event, interval, sum))/counts
  data_long$z <- c(phat[data_long$interval])

  params$data_long <- data_long
  params
}

m1 <- d1 %>% 
  le("data") %>% 
  coxph(Surv(time,event)~trt, data = ., ties="breslow")

m2 <- d1 %>% 
  tolong %>% 
  le("data_long") %>% 
  glm(event ~ trt + factor(tstop)-1,data = .,family=poisson)

m3 <- d1 %>% 
  tolong %>% 
  le("data_long") %>% 
  glm(event ~ x + offset(log(z)), family = poisson, data = .)

rbind(
    summary(m1)$coefficients[1,c(1,3)]
  , summary(m2)$coefficients[1,1:2]
  , summary(m3)$coefficients[2,1:2]
) %>% 
  as.data.frame %>% 
  mutate(Model = c("Cox","Poisson (exact)","Poisson (approx)")) %>% relocate(Model) %>% 
  gt() 
Model coef se(coef)
Cox -0.2989313 0.06335280
Poisson (exact) -0.2989313 0.06335279
Poisson (approx) -0.2989312 0.06335273

Posterior distribution of the treatment effect

The purpose of this document is to explain the derivation of the posterior distribution for \(\beta_1\) in a Bayesian analysis. The jumping-off point is the Poisson GLM model with offset. \[ Y_i\sim \text{Poisson}(\lambda_i) \] \[ \log(\lambda_i) = \beta_0 + \beta_1 x_i + log(z_i) \] The prior for the treatment effect is potentially informative, with variance \(\theta\). The prior for the intercept is an improper flat prior. \[ \beta_0 \sim \text{flat prior}\\ \beta_1 \sim N(0,\theta) \]

The log posterior is of the form

\[ \beta_0\sum_i y_i + \beta_1\sum_i y_ix_i - \sum_i e^{\beta_0 + \beta_1x_i + \log(z_i)} - \frac{1}{2\theta}\beta_1^2 + \text{constant}. \]

To integrate out \(\beta_0\), note that the posterior can be expressed as

\[ e^{A\beta_0}e^{-Be^{\beta_0}}C \] with

\[ \begin{aligned} A &= \sum_iy_i\\ B &= \sum_i e^{\beta_1x_i + \log(z_i)}\\ \log C &= \beta_1\sum_i y_ix_i - \frac{1}{2\theta}\beta_1^2 + \text{constant}\\ \end{aligned} \]

and

\[ \int_{-\infty}^{\infty} e^{A\beta_0}e^{-Be^{\beta_0}}C\, d\beta_0 = B^{-A}\Gamma(A)C \]

The log posterior of \(\beta_1\) (denoted \(\log f(\beta_1|x,y)\)) is

\[ -\left[\sum_iy_i\right]\log\left[\sum_ie^{\beta_1x_i + \log(z_i)} \right] + \beta_1\sum_i y_ix_i - \frac{1}{2\theta}\beta_1^2 + \text{new constant} \]

The second approximation

To approximate the posterior with a normal distribution, we use a Taylor series. Specifically,

\[ \hat{\mu} = \text{root}\ \frac{d}{d\beta_1} \log f(\beta_1|x,y) \]

\[ \hat{\theta} = -\left[\left. \frac{d^2}{d\beta_1^2} \log f(\beta_1|x,y) \right|_{\hat{\mu}}\right]^{-1}. \]

Because the first and second derivatives of the log posterior are readily available, and the root is extremely easy to calculate, the computation time for estimating the posterior distribution is dramatically reduced compared to sampling approaches.

How good is the approximation?

First, we analyze a single dataset of increasing size. On the left of each plot, we show the Kaplan-Meier curves for the survival data; on the right we show the posterior function overlaid by the normal approximation. Because our expression for the posterior is only proportional to a distribution (that is, it isn’t normalized to have unit area), we normalize the expression to have unit mode. Likewise, the same type of normalization was implemented for the normal approximation. In this example, a skeptical prior with variance 0.1 is used.

getmusd <- function(theta,y,x,z){
 tmp <- function(beta) -sum(y)/sum(exp(beta*x + log(z)))*sum(x*exp(beta*x + log(z))) + sum(y*x) - beta/theta
 mu <- uniroot(tmp,c(-5,5))$root
 sy <- sum(y)
 syx <- sum(y*x)
 i1 <- -sy/sum(exp(mu*x + log(z))) * sum(x^2*exp(mu*x+log(z))) + sy*sum(x*exp(mu*x+log(z)))^2/(sum(exp(mu*x+log(z)))^2) - 1/theta
 sd1 <- sqrt(-1/i1)
 c(mu,sd1)
}

nlp <- function(beta,theta,y,x,p){
  mu <- getmusd(theta,y,x,p)[1]
  sy <- sum(y)
  syx <- sum(y*x)
  lp <- beta*0
  n1 <- -sy * log(sum(exp(mu*x + log(p)))) + mu*syx - 1/2/theta*mu^2  
  
  for(i in seq_along(beta)){
    lp[i] <- -sy * log(sum(exp(beta[i]*x + log(p)))) + beta[i]*syx - 1/2/theta*beta[i]^2  - n1
  }
  exp(lp)
}

oneplot <- function(params, theta = 0.1){
  par(mfrow = c(1,2), oma = c(0,0,2,0))
  
  fit <- survfit(Surv(time,event) ~ trt, data = params$data)
  plot(fit, col = c("red","blue"), lwd=3, xlab = "Days", ylab = "Survival probability")
  # g <- ggsurvplot(
  #     fit
  #   , conf.int= TRUE
  #   , data = params$data
  #   , ggtheme = theme_light()
  # )
  # print(g)

  y <- params$data_long$event
  x <- params$data_long$x
  z <- params$data_long$z
  musd <- getmusd(theta,y,x,z)
  bbb <- seq(musd[1]-4*musd[2],musd[1]+4*musd[2],length=100)
  ppp <- nlp(bbb,theta,y,x,z)
  plot(
      bbb
    , ppp
    , xlab = expression(beta[1])
    , ylab = "normalized posterior"
    , pch = 16
    , xlim = c(-1,1.2)
  )
  g <- function(x) {
    dnorm(x,musd[1],musd[2])/dnorm(musd[1],musd[1],musd[2])
  } 
  curve(g,add=TRUE, lwd=3, col = "#00000070")  
  legend(
      "topright"
    , legend = c(as.expression(bquote(.("f(")*beta[1]*.("|")*x*.(",")*y*.(")  "))), expression(N(hat(mu),hat(theta))))
    , pch = c(16, NA)
    , lwd = c(NA,3)
    , col = c("black","#00000070")
    , bty = "n"
  )
  title("N = " %|% max(params$data$row), font.main=1, line = 0, outer = TRUE)
  abline(v=params$lro, col = "navy")
  axis(3,at=params$lro,label = "true log HR")
}

N <- function(x, n){ x$data <- x$data %>% filter(row <= n); x}

plotstyle(style=upright, cex.axis=1, mar=c(2,2,2,1))

d1 %>% 
  N(5) %>% 
  tolong %>% 
  oneplot

d1 %>% 
  N(25) %>% 
  tolong %>% 
  oneplot

d1 %>% 
  N(125) %>% 
  tolong %>% 
  oneplot

d1 %>% 
  N(625) %>% 
  tolong %>% 
  oneplot

d1 %>% 
  tolong %>% 
  oneplot

Second, we compare the approximate posterior solution to the brms sampling solution. In the plot below, we have a QQ-type plot. The quantiles from the normal approximation are paired with the empirical quantiles of the 4000 posterior draws. The plot is generated for both N=300 and N=1200.

if(!("stan-fit-n300.rds" %in% list.files())){

prior1 <- gsub("PRIOR_SD",sqrt(0.1),"normal(0,PRIOR_SD)")  

m300 <- brm(
  formula = event ~ as.factor(tstop) + trt + offset(log(tduration)),
  family = poisson(),
  prior = set_prior(prior1, class = 'b', coef = "trt"), 
  data = d1 %>% N(300) %>% tolong %>% le("data_long"), 
  chains = 4, cores = 4, iter = 2000, refresh=0, silent=2, open_progress=FALSE)

saveRDS(m300, file = "stan-fit-n300.rds")

m1200 <- brm(
  formula = event ~ as.factor(tstop) + trt + offset(log(tduration)),
  family = poisson(),
  prior = set_prior(prior1, class = 'b', coef = "trt"), 
  data = d1 %>% tolong %>% le("data_long"), 
  chains = 4, cores = 4, iter = 2000, refresh=0, silent=2, open_progress=FALSE)

saveRDS(m1200, file = "stan-fit-n1200.rds")

}else{
  m300 <- readRDS("stan-fit-n300.rds")
  m1200 <- readRDS("stan-fit-n1200.rds")
}

plotstyle(style=upright,mar=c(3,3,1,1),mfrow=c(1,2),mgp=c(2,0.1,0),cex.axis=1)

p1 <- spread_draws(m300, b_trt) %>% pull(b_trt)
d2 <- d1 %>% N(300) %>% tolong %>% le("data_long")
musd1 <- getmusd(.1,d2$event,d2$x,d2$z)
theo1 <- qnorm(ppoints(4000),musd1[1],musd1[2])
emp1 <- sort(p1)
plot(theo1,emp1,xlab = "Normal approximation", ylab = "Sampling")
abline(0,1)
title("N=300", cex.main=1, font.main=1)

p1 <- spread_draws(m1200, b_trt) %>% pull(b_trt)
d2 <- d1 %>% N(1200) %>% tolong %>% le("data_long")
musd1 <- getmusd(.1,d2$event,d2$x,d2$z)
theo1 <- qnorm(ppoints(4000),musd1[1],musd1[2])
emp1 <- sort(p1)
plot(theo1,emp1,xlab = "Normal approximation", ylab = "Sampling")
abline(0,1)
title("N=1200", cex.main=1, font.main=1)

The approximate normal posterior distribution matches extremely well with the posterior distribution generated from sampling.

Third, we compare the posterior probabilities from the approximate solution to that of the sampling method in 25 datasets. Specifically, we calculate \(P(\beta_1 > 0)\).

if(!("post-prob-sim.rds" %in% list.files())){
out <- array(NA,c(25,2))

for(i in 1:nrow(out)){
  show(i)
  
  set.seed(i*100)
  r1 <- list(k = 1.8, lro = -.3) %>% 
    generate_data() 
  
  prior1 <- gsub("PRIOR_SD",sqrt(0.1),"normal(0,PRIOR_SD)")  

  m1 <- brm(
    formula = event ~ as.factor(tstop) + trt + offset(log(tduration)),
    family = poisson(),
    prior = set_prior(prior1, class = 'b', coef = "trt"), 
    data = r1 %>% N(300) %>% tolong %>% le("data_long"), 
    chains = 4, cores = 4, iter = 2000, refresh=0, silent=2, open_progress=FALSE)
  out[i,1] <- spread_draws(m1, b_trt) %>% pull(b_trt) %>% `>`(0) %>% mean
  
  
  r2 <- r1 %>% N(300) %>% tolong %>% le("data_long")
  musd1 <- getmusd(.1,r2$event,r2$x,r2$z)
  out[i,2] <- 1-pnorm(0,musd1[1],musd1[2])
  
  saveRDS(out,"post-prob-sim.rds")
}

}else{
  out <- readRDS("post-prob-sim.rds")
}

plotstyle(style=upright,mar=c(3,3,1,1),mfrow=c(1,2),mgp=c(2,0.1,0),cex.axis=1)
plot(out, xlab = "Sampling", ylab = "Normal Approximation")
abline(0,1)
plot(rowMeans(out),out[,1]-out[,2],xlab = "Means", ylab = "Differences", main = "Bland Altman", cex.main = 1, font.main = 1)

Calibration of type I error

In a trial with multiple looks, the variance of the treatment effect prior can be selected to achieve a desired type I error rate.

Simulated outcome data were generated from a Weibull distribution. Censoring occurred after 28 days. The resulting survival curve for the reference (placebo) arm is depicted in blue in the figure below. In red, we show the survival curves when the hazard ratio is 3/2 or 2/3. The reference survival curve was selected so that approximately 88% of subject experienced an event within 28 days.

out <- out2 <- list()
for(j in 1:20){
  out[[j]] <- list(k = 1.8, lro = log(1.5)) %>% 
    generate_data() %>% 
    le("data")  
  out2[[j]] <- list(k = 1.8, lro = -log(1.5)) %>% 
    generate_data() %>% 
    le("data")  
}
o1 <- do.call("rbind",out)
o2 <- do.call("rbind",out2)

plotstyle(style=upright, mar=c(3,3,2,1), cex.axis=1, mgp = c(2.1, 0.1,0))
plot(survfit(Surv(time,event)~1,data=o1 %>% filter(trt==0)),conf.int = FALSE,col="blue",lwd=3)

lines(survfit(Surv(time,event)~1,data=o1 %>% filter(trt==1)),conf.int = FALSE,col="red",lwd=3)

lines(survfit(Surv(time,event)~1,data=o2 %>% filter(trt==1)),conf.int = FALSE,col="red",lwd=3)

title(xlab = "Days", ylab = "Survival probability")
title(main = "Survival curve for simulated data", font.main = 1, cex.main = 1)
legend("topright",c("Reference (placebo)", "HR=3/2 or 2/3"), col = c("blue","red"), lwd=3, bty = "n")

In the figure below, we show the results of a simulation study which indicate that a prior variance of 0.017 will achieve an error rate of 0.05. Note that this simulation study does not reflect a futility rule, which if implemented, may reduce the type I error rate and allow for a less skeptical prior.

s1 <- readRDS("type-1-error-calibration.rds")
s2 <- readRDS("type-1-error-calibration-1.rds")
s3 <- rbind(s1,s2)
a1 <- s3 %>% 
  group_by(prior) %>% 
  summarise(alpha = mean(result), n = length(result)) 

n2 <- nls(result ~ A/(1+exp(- B - C*log10(prior))), data = s3, start = list(A = 1, B = 1, C = 1))

xxx <- 10^seq(-3,3,length=100)
yyy <- predict(n2, newdata = data.frame(prior = xxx))
plotstyle(style=upright, mar=c(3,3,2,1), cex.axis=1, mgp = c(2.1, 0.1,0))
plot(a1[,1:2], log = "x", cex=2, col="#0047ab70", xlab = "Prior variance", ylab = "Type I error")
lines(xxx,yyy, lwd = 5, col = "#0047ab")
title(main = "Prior calibration for interim looks at\nN=300, 600, 900, and 1200", font.main = 1, cex.main = 1)
v2 <- approxfun(yyy,xxx)(0.05)
abline(v=v2,h=0.05,col="#00000070")

Power

Using the prior standard deviation identified in the calibration plot above, we perform a simulation study to get at the power of a trial with looks at 300, 600, 900, and 1200 subjects. The minimal detectable difference at 80% power is HR=1.17 when the data follow the distribution used in this simulation.

p1 <- readRDS("power.rds")
a1 <- p1 %>% 
  group_by(lro) %>% 
  summarise(power = mean(result), n = length(result)) 

g1 <- glm(result ~ poly(lro,3),data = p1,family="binomial")
xxx <- seq(0,max(a1$lro),length=50)
yyy <- predict(g1, newdata = data.frame(lro = xxx),type="response")
plotstyle(style=upright, mar=c(3,3,2,1), cex.axis=1, mgp = c(2.1, 0.1,0))

plotstyle(style=upright, mar=c(3,3,2,1), cex.axis=1, mgp = c(2.1, 0.1,0))
plot(a1[1:10,c(1:2)],cex=2, col="#0047ab70", xlab = expression(beta[1]), ylab = "")
lines(xxx,yyy,lwd = 5, col = "#0047ab")
title(main = "Power for trial with interim looks at\nN=300, 600, 900, and 1200", font.main = 1, cex.main = 1)
p2 <- approxfun(yyy,xxx)(0.8)
abline(h=0.8,v=p2,col="#00000070")