Markov Modeling of Binary Outcomes Assessed at Irregular Times
1 Introduction
The purpose of this investigation is to study whether a first-order Markov binary logistic model can obtain valid estimates when there are missing measurements, with no adjustment other than interacting measurement or gap times with previous state.
require(rms)
require(data.table)
knitrSet(lang='markdown', w=7, h=7, dev='png', fig.path='png/irreg-')
options(prType='html')
2 Simulation Model
Simulate data for 50,000 patients assessed at times 1, 2, …, m where m is a random number between 1 and 20 for each patient. Assume a first-order Markov model with autocorrelation but without an interaction between time and previous state. There is a linear (in the logit) time trend on the outcome.
# Simulate m records for one subject, with covariate value x
# id=numeric ID for subject, y0=initial state 0/1
# a=intercept, b1=x effect (log odds ratio), b2=previous y effect (log OR)
# b3=time effect on 1:m scale
<- function(m, id, x, y0, a=0, b1=0.5, b2=0.6, b3=0.05) {
sim <- y0
yprev <- integer(m)
y <- qlogis(runif(m))
r for(i in 1 : m) {
<- a + b1 * x + b2 * yprev + b3 * i
L <- y[i] <- 1 * (r[i] <= L)
yprev
}cbind(id=id, x=x, time=1 : m, yprev=c(y0, y[-m]), y=y)
}
# Create a dataset for 50,000 subjects, half for each of x=0,1 with
# number of measurements randomly varying from 1 to 20
# All subjects have initial state of 0
<- 50000
n <- sample(1 : 20, n, replace=TRUE)
m <- sample(0 : 1, n, replace=TRUE)
x <- 0
y0 <- matrix(NA, nrow=sum(m), ncol=5)
w colnames(w) <- c('id', 'x', 'time', 'yprev', 'y')
<- 0
ie for(i in 1 : n) {
<- ie + 1
is <- is + m[i] - 1
ie <- 0
y0 : ie, ] <- sim(m[i], i, x[i], y0)
w[is
}<- as.data.frame(w)
w setDT(w, key=c('x', 'id'))
Compute state occupancy probabilities using simple proportions
<- w[, .(p = mean(y)), by=.(x, time)]
prop prop
x time p
1: 0 1 0.5099673
2: 0 2 0.5934273
3: 0 3 0.6185800
4: 0 4 0.6394838
5: 0 5 0.6512103
6: 0 6 0.6636740
7: 0 7 0.6795389
8: 0 8 0.6874275
9: 0 9 0.7054971
10: 0 10 0.7175125
11: 0 11 0.7272944
12: 0 12 0.7369718
13: 0 13 0.7305550
14: 0 14 0.7626888
15: 0 15 0.7643346
16: 0 16 0.7714015
17: 0 17 0.7929541
18: 0 18 0.7965608
19: 0 19 0.8040593
20: 0 20 0.8146688
21: 1 1 0.6341199
22: 1 2 0.7220043
23: 1 3 0.7499667
24: 1 4 0.7559482
25: 1 5 0.7719272
26: 1 6 0.7808622
27: 1 7 0.7898352
28: 1 8 0.7971238
29: 1 9 0.8009615
30: 1 10 0.8076699
31: 1 11 0.8204653
32: 1 12 0.8296998
33: 1 13 0.8391637
34: 1 14 0.8390962
35: 1 15 0.8511146
36: 1 16 0.8579080
37: 1 17 0.8580504
38: 1 18 0.8717813
39: 1 19 0.8814038
40: 1 20 0.8985849
x time p
:='Empirical'] prop[, type
3 Analysis of Regularly-Spaced Points
Fit the model to the complete dataset to make sure the above programming is right. The period of observation varies by patient but within that there are no missing times.
<- lrm(y ~ x + yprev + time * yprev, data=w)
f f
Logistic Regression Model
lrm(formula = y ~ x + yprev + time * yprev, data = w)
Model Likelihood Ratio Test |
Discrimination Indexes |
Rank Discrim. Indexes |
|
---|---|---|---|
Obs 526277 | LR χ2 26482.02 | R2 0.071 | C 0.642 |
0 146306 | d.f. 4 | g 0.584 | Dxy 0.283 |
1 379971 | Pr(>χ2) <0.0001 | gr 1.794 | γ 0.290 |
max |∂log L/∂β| 4×10-10 | gp 0.114 | τa 0.114 | |
Brier 0.191 |
β | S.E. | Wald Z | Pr(>|Z|) | |
---|---|---|---|---|
Intercept | -0.0023 | 0.0080 | -0.28 | 0.7775 |
x | 0.5071 | 0.0064 | 79.35 | <0.0001 |
yprev | 0.5982 | 0.0112 | 53.21 | <0.0001 |
time | 0.0507 | 0.0011 | 45.90 | <0.0001 |
yprev × time | -0.0003 | 0.0014 | -0.23 | 0.8180 |
Use the fitted model to estimate the state occupancy probabilities for times 1, 2, …, 20 for x=0,1
# Compute unconditional probabilities
# Output is a matrix with s rows corresponding to t=times(1), times(2), ...
# Each row has columns corresponding to the probabilities of being in each of the states
# times(1) must be 1
#
# fit is the model fit which must contain the variable named day and the
# previous state, which must be named by the value of pvarname
# data is a data frame or data table that contains covariate settings
# other than day and what's in pvarname. data has one row.
# times is the vector of days for which to evaluate state occupancy probabilities
# ylevels are the ordered levels of the outcome variable
<- function(fit, data, times, ylevels, pvarname='yprev', absorb=NULL) {
uprob <- length(ylevels)
k <- length(times)
s <- matrix(NA, nrow=s, ncol=k)
P colnames(P) <- as.character(ylevels)
rownames(P) <- as.character(times)
# Never uncondition on initial state
$time <- times[1]
data<- predict(fit, data, type='fitted.ind')
p if(k == 2) p <- c(1. - p, p) # predict doesn't predict all levels if Y binary
1, ] <- p
P[# Matrix of conditional probabilities of Y conditioning on previous time Y
# Columns = k conditional probabilities conditional on a single previous state
# Rows = all possible previous states
<- as.list(data)
data <- as.character(setdiff(ylevels, absorb)) # don't request estimates after absorbing state
data[[pvarname]] <- expand.grid(data)
edata for(it in 2 : s) {
$time <- times[it]
edata<- predict(fit, edata, type='fitted.ind')
pp if(k == 2) pp <- cbind(1. - pp, pp)
<- if(! length(absorb)) pp
cp else
if(ylevels[1] == absorb) rbind(c(1., rep(0., k - 1)), pp)
else
if(ylevels[k] == absorb) rbind(pp, c(rep(0., k - 1), 1.))
# Compute unconditional probabilities of being in all possible states at current time t
<- t(cp) %*% P[it - 1, ]
P[it, ]
}
P
}
<- data.frame(yprev=0)
dat <- 1 : 20
times <- 0 : 1
xlev
<- NULL
U for(x in xlev) { # separately for treatments
$x <- x
dat<- uprob(f, dat, times, 0:1)
u cat('\nx=', x, '\n\n')
print(round(u, 3))
<- rbind(U, data.frame(x, time=times, p=u[, 2]))
U }
x= 0
0 1
1 0.488 0.512
2 0.402 0.598
3 0.378 0.622
4 0.363 0.637
5 0.350 0.650
6 0.337 0.663
7 0.324 0.676
8 0.312 0.688
9 0.300 0.700
10 0.288 0.712
11 0.276 0.724
12 0.265 0.735
13 0.254 0.746
14 0.243 0.757
15 0.233 0.767
16 0.223 0.777
17 0.214 0.786
18 0.204 0.796
19 0.195 0.805
20 0.187 0.813
x= 1
0 1
1 0.365 0.635
2 0.275 0.725
3 0.255 0.745
4 0.243 0.757
5 0.233 0.767
6 0.223 0.777
7 0.213 0.787
8 0.204 0.796
9 0.195 0.805
10 0.186 0.814
11 0.178 0.822
12 0.170 0.830
13 0.162 0.838
14 0.155 0.845
15 0.148 0.852
16 0.141 0.859
17 0.135 0.865
18 0.129 0.871
19 0.123 0.877
20 0.117 0.883
setDT(U)
:= 'Model']
U[, type <- rbind(prop, U)
z setkey(z, type, x, time)
ggplot(z, aes(x=time, y=p, color=factor(x), linetype=type)) + geom_line() +
xlab('t') + ylab('P(Y=1)') + labs(color='x', linetype='')
4 Analysis of Irregularly-Spaced Points
Take the large simulated dataset used above and randomly sample half of all the person-periods. Recompute the lagged outcome so that it only knows about the most recent outcome that was actually sampled.
<- w[runif(nrow(w)) < 0.5, ]
ws setkey(ws, x, time)
:= shift(y), by=id]
ws[, yprev is.na(yprev), yprev := 0] # all patients had initial state 0
ws[:= shift(time), by=id]
ws[, tprev is.na(tprev), tprev := 0]
ws[:= time - tprev]
ws[, tgap table(ws$tgap)
1 2 3 4 5 6 7 8 9 10 11
144082 65759 29666 13162 5855 2655 1188 502 201 84 49
12 13 14
15 3 2
# View(ws[1:1000, ])
<- lrm(y ~ x + yprev + (rcs(time, 5) + rcs(tgap, 5)) * yprev, data=ws)
g g
Logistic Regression Model
lrm(formula = y ~ x + yprev + (rcs(time, 5) + rcs(tgap, 5)) * yprev, data = ws)
Model Likelihood Ratio Test |
Discrimination Indexes |
Rank Discrim. Indexes |
|
---|---|---|---|
Obs 263223 | LR χ2 11579.56 | R2 0.062 | C 0.632 |
0 73459 | d.f. 14 | g 0.543 | Dxy 0.265 |
1 189764 | Pr(>χ2) <0.0001 | gr 1.722 | γ 0.268 |
max |∂log L/∂β| 9×10-11 | gp 0.106 | τa 0.107 | |
Brier 0.192 |
β | S.E. | Wald Z | Pr(>|Z|) | |
---|---|---|---|---|
Intercept | -0.3738 | 0.0246 | -15.19 | <0.0001 |
x | 0.5318 | 0.0090 | 59.06 | <0.0001 |
yprev | 1.1107 | 0.0701 | 15.84 | <0.0001 |
time | 0.0438 | 0.0111 | 3.96 | <0.0001 |
time’ | 0.1694 | 0.1065 | 1.59 | 0.1119 |
time’’ | -0.5254 | 0.3198 | -1.64 | 0.1004 |
time’’’ | 0.5791 | 0.3785 | 1.53 | 0.1260 |
tgap | 0.3527 | 0.0162 | 21.77 | <0.0001 |
tgap’ | -0.3870 | 0.0239 | -16.21 | <0.0001 |
yprev × time | 0.0184 | 0.0252 | 0.73 | 0.4647 |
yprev × time’ | -0.2064 | 0.1846 | -1.12 | 0.2636 |
yprev × time’’ | 0.5355 | 0.5037 | 1.06 | 0.2877 |
yprev × time’’’ | -0.4275 | 0.5258 | -0.81 | 0.4162 |
yprev × tgap | -0.5429 | 0.0216 | -25.11 | <0.0001 |
yprev × tgap’ | 0.6199 | 0.0327 | 18.96 | <0.0001 |
anova(g)
Wald Statistics for y
|
|||
χ2 | d.f. | P | |
---|---|---|---|
x | 3487.68 | 1 | <0.0001 |
yprev (Factor+Higher Order Factors) | 1794.23 | 7 | <0.0001 |
All Interactions | 810.67 | 6 | <0.0001 |
time (Factor+Higher Order Factors) | 2475.28 | 8 | <0.0001 |
All Interactions | 6.86 | 4 | 0.1434 |
Nonlinear (Factor+Higher Order Factors) | 8.81 | 6 | 0.1847 |
tgap (Factor+Higher Order Factors) | 798.88 | 4 | <0.0001 |
All Interactions | 766.31 | 2 | <0.0001 |
Nonlinear (Factor+Higher Order Factors) | 371.47 | 2 | <0.0001 |
yprev × time (Factor+Higher Order Factors) | 6.86 | 4 | 0.1434 |
Nonlinear | 4.03 | 3 | 0.2586 |
Nonlinear Interaction : f(A,B) vs. AB | 4.03 | 3 | 0.2586 |
yprev × tgap (Factor+Higher Order Factors) | 766.31 | 2 | <0.0001 |
Nonlinear | 359.52 | 1 | <0.0001 |
Nonlinear Interaction : f(A,B) vs. AB | 359.52 | 1 | <0.0001 |
TOTAL NONLINEAR | 395.99 | 8 | <0.0001 |
TOTAL INTERACTION | 810.67 | 6 | <0.0001 |
TOTAL NONLINEAR + INTERACTION | 924.61 | 10 | <0.0001 |
TOTAL | 11015.88 | 14 | <0.0001 |
Absolute time is important, as it should be, but sensibly does not have a strong interaction with previous state. The gap time is very important, is very nonlinear, and interacts extremely strongly with previous state. From the coefficients one can see that longer gap times discounts the effect of previous state. Let’s plot the estimated effect of gap time, by previous state.
<- datadist(ws); options(datadist='dd')
dd ggplot(Predict(g, tgap, yprev))
Use the fitted model that allows for the previous state’s influence to vary with gap time to get logits of conditional probabilities over uniform times 1, 2, …, 20. This is done for initial state 0 and x=0, 1. The true values are in the last two columns. We estimate the true log odds over this regular grid by getting predictions for a gap of 1.0.
<- function() {
predlogit <- NULL
v for(x in 0 : 1) {
for(t in 1 : 20) {
<- data.frame(x=x, yprev=0:1, time=t, tgap=1)
d <- rbind(v, data.frame(x=x, time=t, yprev=0:1, logit=predict(g, d), type='Model'))
v <- 0.5 * x + 0.6 * (0 : 1) + 0.05 * t
z <- rbind(v, data.frame(x=x, time=t, yprev=0:1, logit=z, type='Actual'))
v
}
}$yprev <- c('Y(t-1)=0', 'Y(t)=1')[v$yprev + 1]
vggplot(v, aes(x=time, y=logit, color=factor(x), linetype=type)) + geom_line() +
facet_wrap(~ yprev) +
xlab('t') + ylab('logit Conditional Probability') + labs(color='x', linetype='')
}predlogit()
The nonlinear effect of gap time and the interaction between gap time and previous state resulted in correct estimates of all the conditional probabilities. That implies we also get correct estimates of unconditional probabilities. It is important to note two features of the simulated data that made it (perhaps too) ideal:
- absolute time and gap time were relatively uncorrelated (not collinear)
- all measurement times were well represented in the data
In other datasets one may have to
- only include one of absolute and gap times in the model if the two are very collinear
- not try to estimate unconditional probabilities (state occupancy probabilities) at time that were not observed
The next simulation tests the last point.
5 Estimation at Unobserved Times
Staying with the simulation model where we completely understand the transition probabilities for each of times 1, 2, …, 20, let’s go back to the more complete simulated dataset and remove all times between 8 and 14.
<- w[time < 8 | time > 14, ]
ws setkey(ws, x, time)
:= shift(y), by=id]
ws[, yprev is.na(yprev), yprev := 0] # all patients had initial state 0
ws[:= shift(time), by=id]
ws[, tprev is.na(tprev), tprev := 0]
ws[:= time - tprev]
ws[, tgap table(ws$tgap)
1 8
335805 15140
For the logistic model, gap time must be treated as linear because it has only two distinct values.
<- lrm(y ~ x + yprev + (rcs(time, 5) + tgap) * yprev, data=ws)
g g
Logistic Regression Model
lrm(formula = y ~ x + yprev + (rcs(time, 5) + tgap) * yprev, data = ws)
Model Likelihood Ratio Test |
Discrimination Indexes |
Rank Discrim. Indexes |
|
---|---|---|---|
Obs 350945 | LR χ2 18774.16 | R2 0.074 | C 0.643 |
0 105562 | d.f. 12 | g 0.596 | Dxy 0.285 |
1 245383 | Pr(>χ2) <0.0001 | gr 1.816 | γ 0.295 |
max |∂log L/∂β| 1×10-9 | gp 0.120 | τa 0.120 | |
Brier 0.199 |
β | S.E. | Wald Z | Pr(>|Z|) | |
---|---|---|---|---|
Intercept | -0.0582 | 0.0253 | -2.30 | 0.0214 |
x | 0.5171 | 0.0077 | 67.59 | <0.0001 |
yprev | 0.5474 | 0.1582 | 3.46 | 0.0005 |
time | 0.0356 | 0.0185 | 1.92 | 0.0552 |
time’ | 1.5428 | 1.1125 | 1.39 | 0.1655 |
time’’ | -2.8066 | 1.9949 | -1.41 | 0.1595 |
time’’’ | 1.7665 | 1.2887 | 1.37 | 0.1704 |
tgap | 0.0588 | 0.0065 | 9.02 | <0.0001 |
yprev × time | 0.0738 | 0.0829 | 0.89 | 0.3736 |
yprev × time’ | -3.4987 | 3.2713 | -1.07 | 0.2848 |
yprev × time’’ | 5.8811 | 5.3710 | 1.09 | 0.2735 |
yprev × time’’’ | -3.0383 | 2.6736 | -1.14 | 0.2558 |
yprev × tgap | -0.0792 | 0.0077 | -10.28 | <0.0001 |
anova(g)
Wald Statistics for y
|
|||
χ2 | d.f. | P | |
---|---|---|---|
x | 4568.54 | 1 | <0.0001 |
yprev (Factor+Higher Order Factors) | 4439.41 | 6 | <0.0001 |
All Interactions | 125.90 | 5 | <0.0001 |
time (Factor+Higher Order Factors) | 2819.98 | 8 | <0.0001 |
All Interactions | 4.31 | 4 | 0.3656 |
Nonlinear (Factor+Higher Order Factors) | 9.86 | 6 | 0.1308 |
tgap (Factor+Higher Order Factors) | 106.02 | 2 | <0.0001 |
All Interactions | 105.64 | 1 | <0.0001 |
yprev × time (Factor+Higher Order Factors) | 4.31 | 4 | 0.3656 |
Nonlinear | 3.46 | 3 | 0.3254 |
Nonlinear Interaction : f(A,B) vs. AB | 3.46 | 3 | 0.3254 |
yprev × tgap (Factor+Higher Order Factors) | 105.64 | 1 | <0.0001 |
TOTAL NONLINEAR | 9.86 | 6 | 0.1308 |
TOTAL INTERACTION | 125.90 | 5 | <0.0001 |
TOTAL NONLINEAR + INTERACTION | 134.66 | 8 | <0.0001 |
TOTAL | 17500.16 | 12 | <0.0001 |
As expected, there is a strong interaction between gap and previous state (dictated by the time sampling scheme), and not between absolute time and previous state (dictated by the simulation model).
predlogit()
6 Summary
In a Markov logistic model It is reasonable to account for irregular measurement times and systematic long gaps in measurement times by including gap times and their interactions with previous states in the model. Irregular measurement times do not induce interactions between absolute time and previous state aside from cases in which gaps are collinear with absolute time, e.g. when measurement times are 1, 2, 4, 8, 16, 32, in which case gap times and measurement times are interchangeable and only one needs to be in the model.
7 Resources
- Papers about Markov longitudinal models
- Detailed case studies especially ORCHID and VIOLET 2
8 Computing Environment
R version 4.0.3 (2020-10-10) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Pop!_OS 20.10 Matrix products: default BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0 LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] data.table_1.13.2 rms_6.1-0 SparseM_1.78 Hmisc_4.4-3 [5] ggplot2_3.3.2 Formula_1.2-4 survival_3.2-7 lattice_0.20-41To cite R in publication use:
R Core Team (2020). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/.