Longitudinal Binary Logistic Model vs. Cox Model

Author
Affiliation

Department of Biostatistics
Vanderbilt University School of Medicine
VICTR Research Methods Program

Published

January 13, 2021

require(rms)
require(ggplot2)
options(prType='html')
# Define a function that given integer-valued event and follow-up
# times creates multiple records per subject.  Event codes are zero for
# all the records, except possibly for the last if an event occurred.
# Example
# dt <- c(0, 0, 1, 1, 4, 7, 7)
# e  <- c(0, 1, 0, 1, 0, 0, 1)
# tx <- c('a', 'b', 'a', 'b', 'a', 'b', 'a')
# id <- 1 : length(dt)
# Want 
# dt=0 e=0: ignore          n=0
# dt 0 e 1: dt 0:7 e 1 ... 1     n=8
# dt 1 e 0: dt 0 1 e 0 0    n=dt+1
# dt 1 e 1: dt 0:7 e 0 1 ... 1    n=8
# dt 4 e 0: dt 0 1 2 3 4 e 0 0 0 0 0               n=dt+1
# dt 7 e 0: dt 0 1 2 3 4 5 6 7 e 0 0 0 0 0 0 0 0   n=dt+1
# dt 7 e 1: dt 0 1 2 3 4 5 6 7 e 0 0 0 0 0 0 0 1   n=dt+1=8

expandit <- function(dt, e, maxt=max(dt),
                     forward=TRUE, allcens=TRUE, id=1 : length(dt)) {
    id   <- id    # so that will evaluate original length
    i    <- ! (dt == 0 & e == 0)
    dt   <- dt[i]; e <- e[i]; id <- id[i]
    n    <- length(dt)
    r    <- if(forward) ifelse(e == 0, dt + 1, maxt + 1) else dt + 1
#    r    <- if(allcens) {
#             if(forward) ifelse(e == 0, dt + 1, m + 1) else dt + 1
#    } else {
#        if(forward) ifelse(e == 0, 1, m + 1) else 
#    }
    subs <- rep(1 : n, r)
    idr  <- id[subs]
    R <- sum(r)
    dtr <- evr <- nn <- integer(R)
    j <- 1
    
    for(i in 1 : n) {
        f <- r[i]
        k <- j + f - 1
        z <- 0 : (f - 1)
        ev <- if(dt[i] == 0 & e[i] == 1) 1
        else
         if(e[i] == 0) rep(0, f) else c(rep(0, dt[i]), rep(1, f - dt[i]))
        dtr[j : k] <- z
        evr[j : k] <- ev
        nn [j : k] <- length(z)
        j <- k + 1
    }
    data.frame(id=idr, dt=dtr, e=evr, n=nn)
}

1 Introduction

Longitudinal binary or ordinal response modeling is a flexible way to model one-time or recurring events and repeated ordinal responses. In the special case where the response variable for a time interval is binary, the event is a terminating event, and interval event probabilities are small, longitudinal binary logistic models’ odds ratios are the same as Cox model hazard ratios1. These even extends to time-dependent covariates. Both of these features are demonstrated here.

Efron (1988) developed the logistic regression approach to the Cox proportional hazards model, even including the use of spline functions of time for flexibly modeling the hazard function, and the use of time-dependent covariates. He was motivated by partial logistic regression of D. R. Cox & Snell (1989), which is related to partial likelihood of David R. Cox (1972). Efron’s Table 1 data layout is the same as our more general “multiple record per patient” layout when there is only one covariate and it is categorical.

2 Comparison of Cox and Longitudinal Logistic Model

Simulate a two-arm study with exponential survival and a hazard ratio of 0.65. Compare the power of the Cox test with a binary sequence corresponding to time intervals, using two approaches: terminate a subject’s records with a 1 for event, or repeat the 1 until the maximum follow-up for the study to represent an absorbing state. For the repeated logistic approach, failure time is rounded to the nearest 0.2 on a 0-15 time unit range. The sample size is 500 subjects (250 in each treatment group). For the logistic model we model the time effect using a flexible restricted cubic spline function with 5 knots (which requires 4 parameters).

One must decide whether to keep a terminating event as a repeated Y=1 after the event first occurs. In the simulation below we will carry deaths forward to the maximum follow-up time over all subjects, then we will also truncate the records at death or censoring times. You will see that this latter approach is the appropriate one.

To understand what this repeated logistic model is estimating, thinking of the sequencies of zeros and ones over time and estimate probabilities without borrowing information (smoothing) over time. Consider the strung-out subject records and go about estimating the probability using just the data from the third time period. If the only covariate is binary, we restrict our attention for the moment to estimating the probability for one of the groups at the third time point. This probability is estimated from a simple proportion: the number of subjects having an event in the third interval divided by the number followed at least to the start of the third interval. This ratio is an estimate of the discrete hazard function at the third time interval. Returning to the two groups, the group coefficient is a log odds ratio. When the discrete hazard is small, the odds ratio is close to the risk ratio, and here the risk ratio would be the discrete hazard ratio.

Going back to consideration of three time periods for subjects in a single group, with no covariates, the estimated cumulative survival probability equals the product of one minus the discrete hazard estimates. If there were no censoring, this product at the third interval equals the crude proportion of those surviving the third interval. So stringing out the calculation interval-by-interval is the same as a simple proportion, and the interval-by-interval indicator variables are conditionally independent. Standard errors computed without regard to there being repeate measures per subject will be correct, and no intra-cluster correlation correction is needed. Still for this report we compare naive and robust standard error estimates to demonstrate this.

Note: also see Approximating the Cox Model.

dosim <- function(sims, n, basehaz, hr, maxcens=75) {
# Special case: sims=1: returns 3 datasets and 3 fit objects
tx   <- c(rep('a', n / 2), rep('b', n / 2))
id   <- 1 : n
haz <- basehaz * ifelse(tx == 'a', 1, hr)

bcox <- blrm <- blrmt <- zcox <- zlrm <- zlrmt <- vest <- vestnaive <- numeric(sims)
for(i in 1 : sims) {
  cens <- maxcens * runif(n)
  dt   <- -log(runif(n)) / haz
  e    <- ifelse(dt <= cens, 1, 0)
  dt   <- pmin(dt, cens)
  S    <- Surv(dt,e)
  f    <- cph(Surv(dt, e) ~ tx, surv=sims==1, x=sims==1, y=sims==1)
  bcox[i] <- coef(f)
  zcox[i] <- coef(f) / sqrt(vcov(f))
  fcox <- f
  dcox <- data.frame(tx, dt, e)
  
  dt   <- round(dt)
  # See also the survSplit function in the survival package
  d    <- expandit(dt, e, id=id, maxt=15*5)
  d$tx <- tx[d$id]
  knots <- c(2, 5, 10, 25, 55)
  g    <- lrm(e ~ tx + rcs(dt, knots), data=d, x=TRUE, y=TRUE, tol=1e-13, maxit=30)
  h    <- robcov(g, d$id)
  blrm[i]<- coef(h)[2]
  v    <- vcov(h)[2,2]
  zlrm[i] <- coef(h)[2] / sqrt(v)
  flrm  <- h
  dlrm  <- d
  
  # Repeat with records truncated at event
  d    <- expandit(dt, e, id=id, maxt=15*5, forward=FALSE)
  d$tx <- tx[d$id]
  g    <- lrm(e ~ tx + rcs(dt, knots), data=d, x=TRUE, y=TRUE, tol=1e-13, maxit=30)
  vestnaive[i] <- vcov(g)[2, 2]
  h    <- robcov(g, d$id)
  blrmt[i] <- coef(h)[2]
  vest[i]  <- vcov(h)[2, 2]
  z    <- coef(h)[2] / sqrt(vest[i])
  zlrmt[i] <- z
  flrmt    <- h
  dlrmt    <- d
  if(sims == 1) return(llist(dcox, dlrm, dlrmt, fcox, flrm, flrmt))
}
bc <- round(c(cox = mean(bcox), lrm = mean(blrm),
              lrmtrunc = mean(blrmt)), 3)

llist(zcox, zlrm, zlrmt, bc, rmnaive=sqrt(mean(vestnaive)),
      rmrobust=sqrt(mean(vest)), sdsim=sd(blrmt))
}

plotit <- function(Z) {
  with(Z, {
    cat('Average regression coefficients:\n')
    print(bc)
    cat('Square root of average naive variance for blrmt:', rmnaive,
    '\nSquare root of average cluster variance estimate for blrmt:', rmrobust,
    '\nStandard deviation of simulated blrmt:', sdsim, '\n')
    
    xl <- 'Z Statistic from Cox Model'
    plot(zcox, zlrm, xlab=xl, ylab='Z from lrm\nevent repeated',main='')
    abline(a=0, b=1, col=gray(.8))
    plot(zcox, zlrmt, xlab=xl, ylab='Z from lrm\nevent not repeated',main='')
    abline(a=0, b=1, col=gray(.8))
    })
}

First simulate one large dataset (30,000 patients) and show predictions from 3 methods. In the first graph, the cumulative 1-unit time interval probabilities are computed for the truncated repeated logistic model.

set.seed(2)
w <- dosim(1, 30000, basehaz=0.004, hr=0.65)
fcox <- w$fcox
dd <- datadist(w$dcox); options(datadist='dd')
survplot(fcox, tx, fun=function(y) 1 - y)
# Add theoretical curves
x <- seq(0, 75, by=1)
cia <- function(t) 1. - exp(-.004 * t)
cib <- function(t) 1. - exp(-.004 * 0.65 * t)
lines(x, cia(x), col=gray(0.85))
lines(x, cib(x), col=gray(0.85))

# Estimate cumulative survival at integer points and use these
# to get discrete hazards.  Do some smoothing.
as <- as.vector(survest(fcox, data.frame(tx = 'a'), times=x)$surv)
bs <- as.vector(survest(fcox, data.frame(tx = 'b'), times=x)$surv)
i <- - length(as)
ah <- - diff(as) / as[i]
bh <- - diff(bs) / bs[i]
plot( lowess(x[i], ah), xlab='t', ylab='Discrete Hazard', type='l', ylim=c(0, .005))
lines(lowess(x[i], bh), lwd=2.5)
lines(x[i], ah, col=gray(0.85))
lines(x[i], bh, col=gray(0.85), lwd=2.5)

# Add predicted risks from lrm with truncation
pa <- Predict(w$flrmt, dt=x, tx='a', fun=plogis)$yhat
pb <- Predict(w$flrmt, dt=x, tx='b', fun=plogis)$yhat
lines(x, pa, col='red')
lines(x, pb, col='red', lwd=2.5)

# Add theoretical 1-time-unit discrete hazards
xc <- 1 : 75
pea <- (cia(xc) - cia(xc - 1)) / (1 - cia(xc - 1))
peb <- (cib(xc) - cib(xc - 1)) / (1 - cib(xc - 1))
lines(xc, pea, col='blue')
lines(xc, peb, col='blue', lwd=2.5)

For the simulated study of 30,000 patients, the above graph shows the theoretical one time unit interval discrete hazard function (blue lines), the Cox model estimated discrete hazards (noisy gray scale lines), a loess smoothed version of the latter (black lines), and estimates from the logistic model (red lines). Thinner lines are for treatment A and thick lines for treatment B. See that the longitudinal binary logistic model with truncation of records at the terminating event is estimating the discrete hazard function, i.e., the probability of having an event within a one unit time interval given the subject was event-free at the start of that interval.

Let’s convert the logistic model discrete hazard estimates to cumulative incidence estimate by taking the cumulative product of one minus discrete hazard separately by treatment. The theoretical cumulative incidences (blue) are superimposed.

aci <- 1 - cumprod(1 - pa)
bci <- 1 - cumprod(1 - pb)
plot(x, aci, type='l', xlab='t', ylab='Cumulative Incidence')
lines(x, bci, lwd=2.5)
lines(x, cia(x), col='blue')
lines(x, cib(x), col='blue', lwd=2.5)

Now run the simulations for both the null case (HR = 1.0) and non-null case (HR = 0.65).

simdone <- TRUE
sims <- 500
if(! simdone) {
  Znull <- dosim(sims, 500, basehaz=0.004, hr=1.0)
  saveRDS(Znull, 'Znull.rds')
  Znon  <- dosim(sims, 500, basehaz=0.004, hr=0.65)
  saveRDS(Znon, 'Znon.rds')
} else {
  Znull <- readRDS('Znull.rds')
  Znon  <- readRDS('Znon.rds')
}

One can see nearly exactly agreement between the Cox model and the truncated longitudinal model (but not for the “once dead always dead” model) for the log hazard ratio for treatment B:A, for both the null case (\(\beta=0\)) and the non-null case where the true \(\beta\) coefficient is log(0.65) = -0.43.

Plots for the null case follow.

plotit(Znull)
Average regression coefficients:
bc 
     cox      lrm lrmtrunc 
  -0.002    0.003   -0.002 
Square root of average naive variance for blrmt: 0.2461157 
Square root of average cluster variance estimate for blrmt: 0.2461855 
Standard deviation of simulated blrmt: 0.2441803 

Make plots for the non-null case (HR = 0.65, log(HR) = -0.43).

plotit(Znon)
Average regression coefficients:
bc 
     cox      lrm lrmtrunc 
  -0.430   -0.464   -0.432 
Square root of average naive variance for blrmt: 0.2765214 
Square root of average cluster variance estimate for blrmt: 0.276617 
Standard deviation of simulated blrmt: 0.2953252 

The \(z\)-statistics from the Cox model and the longitudinal logistic model are in almost exact agreement when records are truncated at the observed event or censoring. Not so for the carry-forward logistic model.

3 Time-Dependent Covariate

Let’s find out how the estimation of time-dependent covariate effects compares, when using the Cox model with a time-dependent covariate to the repeated binary logistic model with the same time dependency. We simulate a two-arm treatment study where each treatment’s survival time distribution is log-normal, with the same \(\sigma\). The hazard ratios in log-normal models converge to one as seen below.

haz <- function(t, mu, sigma) {   # hazard function for log-normal distribution
  z <- (log(t) - mu) / sigma
  dnorm(z) / sigma / pnorm(-z) / t
  dlnorm(t, mu, sigma) / (1 - plnorm(t, mu, sigma))
}
sigma <- 1
mu1 <- 0; mu2 <- 0.5
times <- seq(0.01, 50, length=200)
plot(times, haz(times, mu2, sigma) / haz(times, mu1, sigma), type='l',
     xlab='t', ylab='B:A Hazard Ratio', ylim=c(0,1))
abline(h=1, col=gray(0.85))

Simulate a two-sample problem where \(\sigma\) is the same in both treatment groups but the mean log survival time differs. Censor uniformly between 3 and 17 time units. We fit a Cox model with a time-varying coefficient for treatment. By specifying the correct log hazard ratio function dictated by the log-normal distribution all we have to check is that the Cox model estimate of the coefficient of this function is close to 1.0.

set.seed(1)
n <- 5000
cens <- runif(n, 3, 5)
t1 <- rlnorm(n / 2, mu1, sigma)
t2 <- rlnorm(n / 2, mu2, sigma)
tx <- c(rep('A', n / 2), rep('B', n / 2))
dt <- c(t1, t2)
event <- ifelse(dt <= cens, 1, 0)
dt <- pmin(dt, cens)
# True log hazard ratio function.  Should have a coefficient of 1.0 in Cox model
lhr <- function(t) log(haz(t, mu2, sigma) / haz(t, mu1, sigma))
# Fit time-dependent Cox model
tdcdone <- TRUE
if(! tdcdone) {
  # coxph fit object is enormous; just save print(fit)
  f <- coxph(Surv(dt, event) ~ tx + tt(tx), tt = function(x, t, ...) (x == 'B') * lhr(t))
  sink('ftdc.txt')
  print(f)
  sink()
}
r <- readLines('ftdc.txt')
cat('', r, sep='\n')

Call:
coxph(formula = Surv(dt, event) ~ tx + tt(tx), tt = function(x, 
    t, ...) (x == "B") * lhr(t))

            coef exp(coef)  se(coef)      z        p
txB    -0.006293  0.993727  0.080089 -0.079    0.937
tt(tx)  0.935785  2.549214  0.151238  6.188 6.11e-10

Likelihood ratio test=270.8  on 2 df, p=< 2.2e-16
n= 5000, number of events= 4284 

Using the same data, expand into binary sequences. Multiply 0-5 event times by 100 and round to get enough resolution. Note that the Cox model handles the time effect by allowing for an underlying nonparametric survival curve. For the logistic model we have to explicitly model the time effect or the interaction with treatment will not mean the same thing. This time effect is modeled very flexibly with a restricted cubic spline in log time.

id   <- 1 : n
m    <- 100
d    <- expandit(round(dt * m), event, id=id, maxt=5 * m, forward=FALSE)
d$tx <- tx[d$id]
d$dt <- with(d, ifelse(dt == 0, 0.5, dt)) / m
# First time interval starts at 0
# Allow time main effect to be flexible
d$ia <- with(d, (tx == 'B') * lhr(dt))
# The following takes only 2s even with 851,000 records
# coxph took 45s but didn't round any values
dd <- datadist(d); options(datadist='dd')
g    <- lrm(e ~ tx + rcs(log(dt), 6) + ia, data=d, x=TRUE, y=TRUE)
g
Logistic Regression Model
lrm(formula = e ~ tx + rcs(log(dt), 6) + ia, data = d, x = TRUE, 
    y = TRUE)
Model Likelihood
Ratio Test
Discrimination
Indexes
Rank Discrim.
Indexes
Obs 850727 LR χ2 782.06 R2 0.015 C 0.607
0 846443 d.f. 7 R27,850727 0.001 Dxy 0.213
1 4284 Pr(>χ2) <0.0001 R27,12787.3 0.059 γ 0.226
max |∂log L/∂β| 2×10-6 Brier 0.005 τa 0.002
β S.E. Wald Z Pr(>|Z|)
Intercept  -3.3827   0.1960 -17.26 <0.0001
tx=B  -0.0123   0.0800 -0.15 0.8782
dt   1.0734   0.1080 9.94 <0.0001
dt'  -1.5406   0.2756 -5.59 <0.0001
dt''   5.6940   2.6350 2.16 0.0307
dt'''  -7.6804   8.0387 -0.96 0.3394
dt''''   3.5774  13.3837 0.27 0.7892
ia   0.9275   0.1512 6.14 <0.0001
# Adjust variance-covariance matrix for intra-cluster correlations
h    <- robcov(g, d$id)
h   # tx -0.01 +/- 0.08  tx x log(hr) 0.9275 +/- 0.15
Logistic Regression Model
lrm(formula = e ~ tx + rcs(log(dt), 6) + ia, data = d, x = TRUE, 
    y = TRUE)
Model Likelihood
Ratio Test
Discrimination
Indexes
Rank Discrim.
Indexes
Obs 850727 LR χ2 782.06 R2 0.015 C 0.607
0 846443 d.f. 7 R27,850727 0.001 Dxy 0.213
1 4284 Pr(>χ2) <0.0001 R27,12787.3 0.059 γ 0.226
Cluster on d$id Brier 0.005 τa 0.002
Clusters 5000
max |∂log L/∂β| 2×10-6
β S.E. Wald Z Pr(>|Z|)
Intercept  -3.3827   0.1750 -19.33 <0.0001
tx=B  -0.0123   0.0790 -0.16 0.8765
dt   1.0734   0.0947 11.34 <0.0001
dt'  -1.5406   0.2521 -6.11 <0.0001
dt''   5.6940   2.5194 2.26 0.0238
dt'''  -7.6804   7.8488 -0.98 0.3278
dt''''   3.5774  13.1800 0.27 0.7861
ia   0.9275   0.1489 6.23 <0.0001

The coefficient of the log hazard ratio function is very close to that estimated from the Cox model (which used no time binning).

Plot the theoretical hazard function for this group and then superimpose the 0.01 time window discrete hazard function. With such small time intervals the discrete hazard function is virtually identical to the instantaneous hazard function.

# Plot the hazard function for tx=A
times <- seq(0, 5, length=200)
plot(times, haz(times, mu1, sigma), type='l', xlab='t', ylab='True Hazard Function')
# Compute probability of an event in each 0.01 time unit interval and divide by the
# probability of survival until the start of the interval to obtain the discrete hazard
ti   <- seq(0.01, 5, by=0.01)
i    <- - length(ti)
Ft   <- plnorm(ti, mu1, sigma)
dhaz <- diff(Ft) / (1 - Ft[i])
# Compute discrete hazard function for 0.01 time unit intervals scale to 1 unit
lines(ti[i], m * dhaz, col='blue')   # superimposed on continuous hazard function

Now plot the logistic estimated probabilities for treatment A as a function of time, along with the discrete hazard function already shown above but now not divided by the interval width.

k <- data.frame(x=ti[i], y=dhaz)
ggplot(Predict(g, dt, tx='A', ia=0, fun=plogis)) + geom_line(aes(x, y, col=I('blue')), data=k) +
  xlab('t') + ylab('Discrete Hazard')

You can see again that the logistic model probability estimates (black) are discrete hazard estimates (population discrete hazard function is blue).

4 Computing Environment

`
 R version 4.3.0 (2023-04-21)
 Platform: x86_64-pc-linux-gnu (64-bit)
 Running under: Pop!_OS 22.04 LTS
 
 Matrix products: default
 BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0 
 LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
 
 attached base packages:
 [1] stats     graphics  grDevices utils     datasets  methods   base     
 
 other attached packages:
 [1] ggplot2_3.4.2 rms_6.7-0     Hmisc_5.1-1  
 
To cite R in publications use:

R Core Team (2023). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/.

To cite the Hmisc package in publications use:

Harrell Jr F (2023). Hmisc: Harrell Miscellaneous. R package version 5.1-1, https://hbiostat.org/R/Hmisc/.

To cite the rms package in publications use:

Harrell Jr FE (2023). rms: Regression Modeling Strategies. https://hbiostat.org/R/rms/, https://github.com/harrelfe/rms.

To cite the ggplot2 package in publications use:

Wickham H (2016). ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. ISBN 978-3-319-24277-4, https://ggplot2.tidyverse.org.

`{=html}

5 References

Cox, David R. (1972). Regression models and life-tables (with discussion). J Roy Stat Soc B, 34, 187–220.
Cox, D. R., & Snell, E. J. (1989). The Analysis of Binary Data (Second). Chapman and Hall.
Efron, B. (1988). Logistic regression, survival analysis, and the Kaplan-Meier curve. J Am Stat Assoc, 83, 414–425. https://doi.org/10.1080/01621459.1988.10478612