Markov Modeling of Binary Outcomes Assessed at Irregular Times

1 Introduction

The purpose of this investigation is to study whether a first-order Markov binary logistic model can obtain valid estimates when there are missing measurements, with no adjustment other than interacting measurement or gap times with previous state.

require(rms)
require(data.table)
knitrSet(lang='markdown', w=7, h=7, dev='png', fig.path='png/irreg-')
options(prType='html')

2 Simulation Model

Simulate data for 50,000 patients assessed at times 1, 2, …, m where m is a random number between 1 and 20 for each patient. Assume a first-order Markov model with autocorrelation but without an interaction between time and previous state. There is a linear (in the logit) time trend on the outcome.

# Simulate m records for one subject, with covariate value x
# id=numeric ID for subject, y0=initial state 0/1
# a=intercept, b1=x effect (log odds ratio), b2=previous y effect (log OR)
# b3=time effect on 1:m scale
sim <- function(m, id, x, y0, a=0, b1=0.5, b2=0.6, b3=0.05) {
  yprev <- y0
  y <- integer(m)
  r <- qlogis(runif(m))
  for(i in 1 : m) {
    L <- a + b1 * x + b2 * yprev + b3 * i
    yprev <- y[i] <- 1 * (r[i] <= L)
  }
  cbind(id=id, x=x, time=1 : m, yprev=c(y0, y[-m]), y=y)
}

# Create a dataset for 50,000 subjects, half for each of x=0,1 with 
# number of measurements randomly varying from 1 to 20
# All subjects have initial state of 0

n  <- 50000
m  <- sample(1 : 20, n, replace=TRUE)
x  <- sample(0 : 1,  n, replace=TRUE)
y0 <- 0
w  <- matrix(NA, nrow=sum(m), ncol=5)
colnames(w) <- c('id', 'x', 'time', 'yprev', 'y')
ie <- 0
for(i in 1 : n) {
  is <- ie + 1
  ie <- is + m[i] - 1
  y0 <- 0
  w[is : ie, ] <- sim(m[i], i, x[i], y0)
}
w <- as.data.frame(w)
setDT(w, key=c('x', 'id'))

Compute state occupancy probabilities using simple proportions

prop <- w[, .(p = mean(y)), by=.(x, time)]
prop
    x time         p
 1: 0    1 0.5099673
 2: 0    2 0.5934273
 3: 0    3 0.6185800
 4: 0    4 0.6394838
 5: 0    5 0.6512103
 6: 0    6 0.6636740
 7: 0    7 0.6795389
 8: 0    8 0.6874275
 9: 0    9 0.7054971
10: 0   10 0.7175125
11: 0   11 0.7272944
12: 0   12 0.7369718
13: 0   13 0.7305550
14: 0   14 0.7626888
15: 0   15 0.7643346
16: 0   16 0.7714015
17: 0   17 0.7929541
18: 0   18 0.7965608
19: 0   19 0.8040593
20: 0   20 0.8146688
21: 1    1 0.6341199
22: 1    2 0.7220043
23: 1    3 0.7499667
24: 1    4 0.7559482
25: 1    5 0.7719272
26: 1    6 0.7808622
27: 1    7 0.7898352
28: 1    8 0.7971238
29: 1    9 0.8009615
30: 1   10 0.8076699
31: 1   11 0.8204653
32: 1   12 0.8296998
33: 1   13 0.8391637
34: 1   14 0.8390962
35: 1   15 0.8511146
36: 1   16 0.8579080
37: 1   17 0.8580504
38: 1   18 0.8717813
39: 1   19 0.8814038
40: 1   20 0.8985849
    x time         p
prop[, type:='Empirical']

3 Analysis of Regularly-Spaced Points

Fit the model to the complete dataset to make sure the above programming is right. The period of observation varies by patient but within that there are no missing times.

f <- lrm(y ~ x + yprev + time * yprev, data=w)
f

Logistic Regression Model

 lrm(formula = y ~ x + yprev + time * yprev, data = w)
 
Model Likelihood
Ratio Test
Discrimination
Indexes
Rank Discrim.
Indexes
Obs 526277 LR χ2 26482.02 R2 0.071 C 0.642
0 146306 d.f. 4 g 0.584 Dxy 0.283
1 379971 Pr(>χ2) <0.0001 gr 1.794 γ 0.290
max |∂log L/∂β| 4×10-10 gp 0.114 τa 0.114
Brier 0.191
β S.E. Wald Z Pr(>|Z|)
Intercept  -0.0023  0.0080 -0.28 0.7775
x   0.5071  0.0064 79.35 <0.0001
yprev   0.5982  0.0112 53.21 <0.0001
time   0.0507  0.0011 45.90 <0.0001
yprev × time  -0.0003  0.0014 -0.23 0.8180

Use the fitted model to estimate the state occupancy probabilities for times 1, 2, …, 20 for x=0,1

# Compute unconditional probabilities
# Output is a matrix with s rows corresponding to t=times(1), times(2), ...
# Each row has columns corresponding to the probabilities of being in each of the states
# times(1) must be 1
# 
# fit is the model fit which must contain the variable named day and the
# previous state, which must be named by the value of pvarname
# data is a data frame or data table that contains covariate settings
# other than day and what's in pvarname.  data has one row.
# times is the vector of days for which to evaluate state occupancy probabilities
# ylevels are the ordered levels of the outcome variable

uprob <- function(fit, data, times, ylevels, pvarname='yprev', absorb=NULL) {
  k <- length(ylevels)
  s <- length(times)
  P <- matrix(NA, nrow=s, ncol=k)
  colnames(P) <- as.character(ylevels)
  rownames(P) <- as.character(times)
  # Never uncondition on initial state
  data$time <- times[1]
  p <- predict(fit, data, type='fitted.ind')
  if(k == 2) p <- c(1. - p, p)   # predict doesn't predict all levels if Y binary
  P[1, ] <- p
  # Matrix of conditional probabilities of Y conditioning on previous time Y
  # Columns = k conditional probabilities conditional on a single previous state
  # Rows    = all possible previous states
  data <- as.list(data)
  data[[pvarname]] <- as.character(setdiff(ylevels, absorb))  # don't request estimates after absorbing state
  edata <- expand.grid(data)
  for(it in 2 : s) {
    edata$time <- times[it]
    pp <- predict(fit, edata, type='fitted.ind')
    if(k == 2) pp <- cbind(1. - pp, pp)
    cp <- if(! length(absorb)) pp
      else
        if(ylevels[1] == absorb) rbind(c(1., rep(0., k - 1)), pp)
      else
        if(ylevels[k] == absorb) rbind(pp, c(rep(0., k - 1), 1.))
    # Compute unconditional probabilities of being in all possible states at current time t
    P[it, ] <- t(cp) %*% P[it - 1, ]
    }
  P
}

dat <- data.frame(yprev=0)
times <- 1 : 20
xlev  <- 0 : 1

U <- NULL
for(x in xlev) {   # separately for treatments
    dat$x <- x
    u <- uprob(f, dat, times, 0:1)
  cat('\nx=', x, '\n\n')
  print(round(u, 3))
  U <- rbind(U, data.frame(x, time=times, p=u[, 2]))
}

x= 0 

       0     1
1  0.488 0.512
2  0.402 0.598
3  0.378 0.622
4  0.363 0.637
5  0.350 0.650
6  0.337 0.663
7  0.324 0.676
8  0.312 0.688
9  0.300 0.700
10 0.288 0.712
11 0.276 0.724
12 0.265 0.735
13 0.254 0.746
14 0.243 0.757
15 0.233 0.767
16 0.223 0.777
17 0.214 0.786
18 0.204 0.796
19 0.195 0.805
20 0.187 0.813

x= 1 

       0     1
1  0.365 0.635
2  0.275 0.725
3  0.255 0.745
4  0.243 0.757
5  0.233 0.767
6  0.223 0.777
7  0.213 0.787
8  0.204 0.796
9  0.195 0.805
10 0.186 0.814
11 0.178 0.822
12 0.170 0.830
13 0.162 0.838
14 0.155 0.845
15 0.148 0.852
16 0.141 0.859
17 0.135 0.865
18 0.129 0.871
19 0.123 0.877
20 0.117 0.883
setDT(U)
U[, type := 'Model']
z <- rbind(prop, U)
setkey(z, type, x, time)
ggplot(z, aes(x=time, y=p, color=factor(x), linetype=type)) + geom_line() +
  xlab('t') + ylab('P(Y=1)') + labs(color='x', linetype='')

4 Analysis of Irregularly-Spaced Points

Take the large simulated dataset used above and randomly sample half of all the person-periods. Recompute the lagged outcome so that it only knows about the most recent outcome that was actually sampled.

ws <- w[runif(nrow(w)) < 0.5, ]
setkey(ws, x, time)
ws[, yprev := shift(y), by=id]
ws[is.na(yprev), yprev := 0]    # all patients had initial state 0
ws[, tprev := shift(time), by=id]
ws[is.na(tprev), tprev := 0]
ws[, tgap := time - tprev]
table(ws$tgap)

     1      2      3      4      5      6      7      8      9     10     11 
144082  65759  29666  13162   5855   2655   1188    502    201     84     49 
    12     13     14 
    15      3      2 
# View(ws[1:1000, ])
g <- lrm(y ~ x + yprev + (rcs(time, 5) + rcs(tgap, 5)) * yprev, data=ws)
g

Logistic Regression Model

 lrm(formula = y ~ x + yprev + (rcs(time, 5) + rcs(tgap, 5)) * 
     yprev, data = ws)
 
Model Likelihood
Ratio Test
Discrimination
Indexes
Rank Discrim.
Indexes
Obs 263223 LR χ2 11579.56 R2 0.062 C 0.632
0 73459 d.f. 14 g 0.543 Dxy 0.265
1 189764 Pr(>χ2) <0.0001 gr 1.722 γ 0.268
max |∂log L/∂β| 9×10-11 gp 0.106 τa 0.107
Brier 0.192
β S.E. Wald Z Pr(>|Z|)
Intercept  -0.3738  0.0246 -15.19 <0.0001
x   0.5318  0.0090 59.06 <0.0001
yprev   1.1107  0.0701 15.84 <0.0001
time   0.0438  0.0111 3.96 <0.0001
time’   0.1694  0.1065 1.59 0.1119
time’’  -0.5254  0.3198 -1.64 0.1004
time’’’   0.5791  0.3785 1.53 0.1260
tgap   0.3527  0.0162 21.77 <0.0001
tgap’  -0.3870  0.0239 -16.21 <0.0001
yprev × time   0.0184  0.0252 0.73 0.4647
yprev × time’  -0.2064  0.1846 -1.12 0.2636
yprev × time’’   0.5355  0.5037 1.06 0.2877
yprev × time’’’  -0.4275  0.5258 -0.81 0.4162
yprev × tgap  -0.5429  0.0216 -25.11 <0.0001
yprev × tgap’   0.6199  0.0327 18.96 <0.0001
anova(g)
Wald Statistics for y
χ2 d.f. P
x 3487.68 1 <0.0001
yprev (Factor+Higher Order Factors) 1794.23 7 <0.0001
All Interactions 810.67 6 <0.0001
time (Factor+Higher Order Factors) 2475.28 8 <0.0001
All Interactions 6.86 4 0.1434
Nonlinear (Factor+Higher Order Factors) 8.81 6 0.1847
tgap (Factor+Higher Order Factors) 798.88 4 <0.0001
All Interactions 766.31 2 <0.0001
Nonlinear (Factor+Higher Order Factors) 371.47 2 <0.0001
yprev × time (Factor+Higher Order Factors) 6.86 4 0.1434
Nonlinear 4.03 3 0.2586
Nonlinear Interaction : f(A,B) vs. AB 4.03 3 0.2586
yprev × tgap (Factor+Higher Order Factors) 766.31 2 <0.0001
Nonlinear 359.52 1 <0.0001
Nonlinear Interaction : f(A,B) vs. AB 359.52 1 <0.0001
TOTAL NONLINEAR 395.99 8 <0.0001
TOTAL INTERACTION 810.67 6 <0.0001
TOTAL NONLINEAR + INTERACTION 924.61 10 <0.0001
TOTAL 11015.88 14 <0.0001

Absolute time is important, as it should be, but sensibly does not have a strong interaction with previous state. The gap time is very important, is very nonlinear, and interacts extremely strongly with previous state. From the coefficients one can see that longer gap times discounts the effect of previous state. Let’s plot the estimated effect of gap time, by previous state.

dd <- datadist(ws); options(datadist='dd')
ggplot(Predict(g, tgap, yprev))

Use the fitted model that allows for the previous state’s influence to vary with gap time to get logits of conditional probabilities over uniform times 1, 2, …, 20. This is done for initial state 0 and x=0, 1. The true values are in the last two columns. We estimate the true log odds over this regular grid by getting predictions for a gap of 1.0.

predlogit <- function() {
  v <- NULL
  for(x in 0 : 1) {
    for(t in 1 : 20) {
      d <- data.frame(x=x, yprev=0:1, time=t, tgap=1)
      v <- rbind(v, data.frame(x=x, time=t, yprev=0:1, logit=predict(g, d), type='Model'))
      z <- 0.5 * x + 0.6 * (0 : 1) + 0.05 * t
      v <- rbind(v, data.frame(x=x, time=t, yprev=0:1, logit=z,             type='Actual'))
    }
  }
  v$yprev <- c('Y(t-1)=0', 'Y(t)=1')[v$yprev + 1]
  ggplot(v, aes(x=time, y=logit, color=factor(x), linetype=type)) + geom_line() +
    facet_wrap(~ yprev) +
    xlab('t') + ylab('logit Conditional Probability') + labs(color='x', linetype='')
}
predlogit()

The nonlinear effect of gap time and the interaction between gap time and previous state resulted in correct estimates of all the conditional probabilities. That implies we also get correct estimates of unconditional probabilities. It is important to note two features of the simulated data that made it (perhaps too) ideal:

  • absolute time and gap time were relatively uncorrelated (not collinear)
  • all measurement times were well represented in the data

In other datasets one may have to

  • only include one of absolute and gap times in the model if the two are very collinear
  • not try to estimate unconditional probabilities (state occupancy probabilities) at time that were not observed

The next simulation tests the last point.

5 Estimation at Unobserved Times

Staying with the simulation model where we completely understand the transition probabilities for each of times 1, 2, …, 20, let’s go back to the more complete simulated dataset and remove all times between 8 and 14.

ws <- w[time < 8 | time > 14, ]
setkey(ws, x, time)
ws[, yprev := shift(y), by=id]
ws[is.na(yprev), yprev := 0]    # all patients had initial state 0
ws[, tprev := shift(time), by=id]
ws[is.na(tprev), tprev := 0]
ws[, tgap := time - tprev]
table(ws$tgap)

     1      8 
335805  15140 

For the logistic model, gap time must be treated as linear because it has only two distinct values.

g <- lrm(y ~ x + yprev + (rcs(time, 5) + tgap) * yprev, data=ws)
g

Logistic Regression Model

 lrm(formula = y ~ x + yprev + (rcs(time, 5) + tgap) * yprev, 
     data = ws)
 
Model Likelihood
Ratio Test
Discrimination
Indexes
Rank Discrim.
Indexes
Obs 350945 LR χ2 18774.16 R2 0.074 C 0.643
0 105562 d.f. 12 g 0.596 Dxy 0.285
1 245383 Pr(>χ2) <0.0001 gr 1.816 γ 0.295
max |∂log L/∂β| 1×10-9 gp 0.120 τa 0.120
Brier 0.199
β S.E. Wald Z Pr(>|Z|)
Intercept  -0.0582  0.0253 -2.30 0.0214
x   0.5171  0.0077 67.59 <0.0001
yprev   0.5474  0.1582 3.46 0.0005
time   0.0356  0.0185 1.92 0.0552
time’   1.5428  1.1125 1.39 0.1655
time’’  -2.8066  1.9949 -1.41 0.1595
time’’’   1.7665  1.2887 1.37 0.1704
tgap   0.0588  0.0065 9.02 <0.0001
yprev × time   0.0738  0.0829 0.89 0.3736
yprev × time’  -3.4987  3.2713 -1.07 0.2848
yprev × time’’   5.8811  5.3710 1.09 0.2735
yprev × time’’’  -3.0383  2.6736 -1.14 0.2558
yprev × tgap  -0.0792  0.0077 -10.28 <0.0001
anova(g)
Wald Statistics for y
χ2 d.f. P
x 4568.54 1 <0.0001
yprev (Factor+Higher Order Factors) 4439.41 6 <0.0001
All Interactions 125.90 5 <0.0001
time (Factor+Higher Order Factors) 2819.98 8 <0.0001
All Interactions 4.31 4 0.3656
Nonlinear (Factor+Higher Order Factors) 9.86 6 0.1308
tgap (Factor+Higher Order Factors) 106.02 2 <0.0001
All Interactions 105.64 1 <0.0001
yprev × time (Factor+Higher Order Factors) 4.31 4 0.3656
Nonlinear 3.46 3 0.3254
Nonlinear Interaction : f(A,B) vs. AB 3.46 3 0.3254
yprev × tgap (Factor+Higher Order Factors) 105.64 1 <0.0001
TOTAL NONLINEAR 9.86 6 0.1308
TOTAL INTERACTION 125.90 5 <0.0001
TOTAL NONLINEAR + INTERACTION 134.66 8 <0.0001
TOTAL 17500.16 12 <0.0001

As expected, there is a strong interaction between gap and previous state (dictated by the time sampling scheme), and not between absolute time and previous state (dictated by the simulation model).

predlogit()

6 Summary

In a Markov logistic model It is reasonable to account for irregular measurement times and systematic long gaps in measurement times by including gap times and their interactions with previous states in the model. Irregular measurement times do not induce interactions between absolute time and previous state aside from cases in which gaps are collinear with absolute time, e.g. when measurement times are 1, 2, 4, 8, 16, 32, in which case gap times and measurement times are interchangeable and only one needs to be in the model.

7 Resources

8 Computing Environment

 R version 4.0.3 (2020-10-10)
 Platform: x86_64-pc-linux-gnu (64-bit)
 Running under: Pop!_OS 20.10
 
 Matrix products: default
 BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
 LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
 
 attached base packages:
 [1] stats     graphics  grDevices utils     datasets  methods   base     
 
 other attached packages:
 [1] data.table_1.13.2 rms_6.1-0         SparseM_1.78      Hmisc_4.4-3      
 [5] ggplot2_3.3.2     Formula_1.2-4     survival_3.2-7    lattice_0.20-41  
 
To cite R in publication use:

R Core Team (2020). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/.