Code
require(rms)          # also gets access to Hmisc
require(data.table)   # used to compute various summary measures
require(VGAM)
knitrSet(lang='quarto')VIOLET was a randomized clinical trial of seriously ill adults in ICUs to study the effectiveness of vitamin D vs. placebo. Vitamin D was not effective. VIOLET 2 is a rich dataset having daily ordinal outcomes assessed for 28 days. Due to an apparent anomaly in ventilator use on day 28 we restrict attention to days 1-27. The original NEJM paper focused on 1078 patients with confirmed baseline vitamin D deficiency. The analyses presented here use 1352 of the original 1360 randomized patients. We fit a partial proportional odds Markov transition model to form a simulation model. The original data may be requested here.
We simulate a 250,000 patient clinical trial using a Markov proportional odds ordinal longitudinal model fitted to the VIOLET study, including a limited number of covariates randomly sampled from the study and overriding the treatment effect on transition odds to have an odds ratio of 0.75 (when OR < 1 denotes benefit). The proportional odds assumption is relaxed for the time effect. The first-order Markov model is fitted with a partial proportional odds model using the VGAM package. Simulating a very large number of patients allows users to do repeated sampling as in simulations, and to analyze any ordinary sample size.
Extensive re-analyses of VIOLET may be found at hbiostat.org/proj/covid19.
require(rms)          # also gets access to Hmisc
require(data.table)   # used to compute various summary measures
require(VGAM)
knitrSet(lang='quarto')dcf <- readRDS('dcf.rds')
setDT(dcf, key=c('id', 'day'))   # includes records with death carried forward
dcf <- dcf[! is.na(bstatus) & day < 28,
           .(id, day, bstatus, status, treatment, ddeath, age, lips,
             charlson, sofa, vitD)]
# Since baseline status did not distinguish ARDS from on ventilator,
# we must pool follow-up status likewise
s <- dcf[, bstatus]
levels(s) <- list('Vent/ARDS'='ARDS', 'Vent/ARDS'='On Vent',
                  'In Hospital/Facility'='In Hospital')
dcf[, table(bstatus, s)]             s
bstatus       Vent/ARDS In Hospital/Facility
  ARDS             2727                    0
  On Vent         10071                    0
  In Hospital         0                23679
dcf[, bstatus := s]
d   <- dcf
d[, pstatus := shift(status), by=id]
d[is.na(pstatus), pstatus := bstatus]
# Don't carry death forward when fitting Markov models:
d <- d[day <= ddeath]
setkey(d, id, day)
# Create a new treatment A B variable
d[, tx := c(Placebo='A', 'Vitamin D'='B')[treatment]]
d[, treatment := NULL]html(describe(d[day == 1, ], 'Baseline Data'))| n | missing | distinct | 
|---|---|---|
| 1351 | 0 | 1351 | 
| lowest : | A01-00208 | A01-00210 | A01-00261 | A01-00337 | A01-00354 | 
| highest: | W05-00581 | W05-00589 | W05-00590 | W05-00591 | W05-00592 | 
| n | missing | distinct | Info | Mean | Gmd | 
|---|---|---|---|---|---|
| 1351 | 0 | 1 | 0 | 1 | 0 | 
Value 1 Frequency 1351 Proportion 1
| n | missing | distinct | 
|---|---|---|
| 1351 | 0 | 2 | 
Value Vent/ARDS In Hospital/Facility Frequency 474 877 Proportion 0.351 0.649
| n | missing | distinct | 
|---|---|---|
| 1351 | 0 | 4 | 
 Value                      Dead            Vent/ARDS In Hospital/Facility
 Frequency                    16                  421                  877
 Proportion                0.012                0.312                0.649
                                
 Value                      Home
 Frequency                    37
 Proportion                0.027
 
 | n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 | 
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1351 | 0 | 28 | 0.375 | 26.23 | 4.925 | 5 | 13 | 29 | 29 | 29 | 29 | 29 | 
| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 | 
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1351 | 0 | 78 | 1 | 56.31 | 18.16 | 27.0 | 33.0 | 46.0 | 58.0 | 67.5 | 76.0 | 82.0 | 
| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 | 
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1351 | 0 | 28 | 0.996 | 5.176 | 3.402 | 0 | 0 | 3 | 5 | 7 | 9 | 10 | 
| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 | 
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1309 | 42 | 17 | 0.987 | 3.822 | 3.245 | 0 | 0 | 2 | 3 | 5 | 8 | 9 | 
 Value          0     1     2     3     4     5     6     7     8     9    10    11
 Frequency    172   154   166   168   170   154    99    74    61    32    23    10
 Proportion 0.131 0.118 0.127 0.128 0.130 0.118 0.076 0.057 0.047 0.024 0.018 0.008
                                         
 Value         12    13    14    15    18
 Frequency     14     6     4     1     1
 Proportion 0.011 0.005 0.003 0.001 0.001
 
 | n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 | 
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1351 | 0 | 19 | 0.993 | 5.377 | 4.029 | 0 | 1 | 3 | 5 | 8 | 10 | 12 | 
 Value          0     1     2     3     4     5     6     7     8     9    10    11
 Frequency     87   123   112   136   158   134   135   106    89    76    65    54
 Proportion 0.064 0.091 0.083 0.101 0.117 0.099 0.100 0.078 0.066 0.056 0.048 0.040
                                                     
 Value         12    13    14    15    16    17    18
 Frequency     32    17    11     6     5     3     2
 Proportion 0.024 0.013 0.008 0.004 0.004 0.002 0.001
 
 | n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 | 
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1300 | 51 | 554 | 1 | 13.74 | 8.285 | 3.898 | 5.163 | 8.047 | 12.600 | 17.800 | 23.310 | 27.300 | 
| n | missing | distinct | 
|---|---|---|
| 1351 | 0 | 2 | 
Value Vent/ARDS In Hospital/Facility Frequency 474 877 Proportion 0.351 0.649
| n | missing | distinct | 
|---|---|---|
| 1351 | 0 | 2 | 
Value A B Frequency 666 685 Proportion 0.493 0.507
Fit a partial PO model, with no PO assumption for the time effect. This allows different mixes of outcome categories over time.
f <- vglm(ordered(status) ~ pstatus + lsp(day, 2) + age + sofa + tx,
              cumulative(reverse=TRUE, parallel=FALSE ~ lsp(day, 2)),
              data=d)
se <- sqrt(diag(vcov(f)))
round(cbind(beta=coef(f), se=se, z=coef(f) / se), 3)                               beta    se       z
(Intercept):1                 4.713 0.531   8.869
(Intercept):2                -1.741 0.292  -5.967
(Intercept):3               -10.034 0.361 -27.800
pstatusIn Hospital/Facility   6.364 0.098  64.706
pstatusHome                  15.797 0.229  68.971
lsp(day, 2)day:1             -0.412 0.280  -1.475
lsp(day, 2)day:2              0.481 0.150   3.218
lsp(day, 2)day:3              1.315 0.175   7.533
lsp(day, 2)day':1             0.410 0.283   1.447
lsp(day, 2)day':2            -0.525 0.152  -3.456
lsp(day, 2)day':3            -1.384 0.176  -7.868
age                          -0.011 0.002  -6.777
sofa                         -0.046 0.007  -6.525
txB                           0.028 0.050   0.560
The non-PO part of the model has 6 parameters: 2 time components \(\times\) 3 states corresponding to PO model intercepts. Two time components were required to accommodate the very low number of patients who were sent home on the first day. The linear spline used here essentially grants an exception at \(t=1\).
First we have to decide the covariate distributions to simulate from. Let’s sample with replacement from the whole set of covariates. Fetch baseline records, making sure that the first record for each patient has day=1. Let’s re-randomize patients since we are going to ignore the observed treatment effect anyway. Then sample with replacement for simulating a trial of N = 100,000 patients.
When sampling baseline covariates we must also sample the baseline states.
N <- 250000
w <- d[, .(first=min(day)), by=id]
w[, range(first)][1] 1 1
set.seed(8)
w <- d[day == 1, .(age, sofa, bstatus, tx=sample(0:1, .N, replace=TRUE))]
# Expand to a large number of patients by sampling with replacement
X <- w[, cbind(age, sofa, tx)]
init <- w[, bstatus]
i    <- sample(1 : nrow(X), N, replace=TRUE)
X    <- X[i, ]
init <- init[i]Noting that he vgam function’s non-PO terms are parameterized to pertain to each category after the first (Dead). We write a g function for the state transitions, and save intercepts separately. Then compose a function to simulate one transition for one patient.
k     <- coef(f)
ints  <- k[1 : 3]
extra <- k[- (1 : 3)]
extra <- extra[-length(extra)] # remove tx coefficient
g <- function(yprev, t, gap, X, parameter=0, extra) {
    tau   <- extra[1:2]   # main intercepts
    kappa <- extra[3:8]   # time effects including non-PO
    beta  <- extra[9:10]  # baseline covariates
    lp <- matrix(0., nrow=length(yprev), ncol=3,
                 dimnames=list(as.character(yprev),
                 c('Vent/ARDS', 'In Hospital/Facility', 'Home')))
    tp <- pmax(t - 2, 0)   # second time term, from linear spline knot at t=2
    # 3 columns = no. distinct y less 1 = length of intercepts
    # lp[yp, ] is a 3-vector because the kappa components are split out into a 3-vector
    if(is.matrix(X)) stop('X not set up correctly')
    for(yp in yprev)
      lp[as.character(yp), ] <- tau[1] * (yp == 'In Hospital/Facility') +
                                  tau[2] * (yp == 'Home') +
                                t * kappa[1:3] + tp * kappa[4:6] +
                                beta[1] * X[1] + beta[2] * X[2] +
                                    parameter * X[3]
  lp
}
formals(g)$extra <- extraSimulate 28 consecutive days for each of the N new patients, using the covariate/treatment/initial state combinations we sampled with replacement to expand to N. But first simulate 10,000 patients with death carried forward and compare distributions to observed data. Note that the treatment OR needs to be \(\frac{1}{0.75}\) since outcomes are reverse coded.
n <- 10000
s <- simMarkovOrd(n=n, y=levels(d$status), times=1:28,
                  initial=init[1:n], X=X[1:n,], absorb='Dead',
                  intercepts=ints, g=g, parameter=-log(0.75), carry=TRUE)
propsPO(y      ~ time, data=subset(s, tx == 1)) +
                    labs(caption='Simulated active arm data')
propsPO(y      ~ time, data=subset(s, tx == 0)) +
                    labs(caption='Simulated placebo data')
propsPO(status ~ day,  data=dcf) + labs(caption='Actual data')Take a detour to examine the correlation patterns in these simulated data in comparison to patterns in the real data. Spearman’s rank correlation is used on the 4-level outcome.
# Make short and wide 
w <- dcast(d[, .(id, day, y=as.numeric(status))],
           id ~ day, value.var='y')
r <- cor(as.matrix(w[, -1]), use='pairwise.complete.obs',
         method='spearman')
plotCorrM(r)[[1]] +
  labs(caption='Spearman correlations for observed data') +
  theme(legend.position='none')
# Repeat for simulated data
setDT(s, key=c('id', 'time'))
w <- dcast(s[time < 28, .(id, time, y=as.numeric(y))],
           id ~ time, value.var='y')
rs <- cor(as.matrix(w[, -1]), use='pairwise.complete.obs',
          method='spearman')
plotCorrM(rs)[[1]] + 
  labs(caption='Spearman correlations for simulated data') +
  theme(legend.position='none')
saveRDS(list(r.actual=r, r.simulated=rs),
        'vcorr.rds', compress='xz')Compute mean absolute difference between correlation coefficients as a function of day
ad <- abs(r - rs)
round(apply(ad, 1, mean), 2)   1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16 
0.09 0.05 0.05 0.05 0.04 0.03 0.03 0.02 0.02 0.02 0.02 0.02 0.02 0.03 0.02 0.02 
  17   18   19   20   21   22   23   24   25   26   27 
0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 
There is excellent agreement between actual and assumed correlation patterns except for correlations involving day 1.
Simulate the large trial without carrying death forward. Compare coefficients fitted on the first 20,000 patients in the simulated sample to those from fitting the real data.
s <- simMarkovOrd(n=N, y=levels(d$status), times=1:28,
                  initial=init, X=X, absorb='Dead',
                  intercepts=ints, g=g, parameter=-log(0.75))
s$parameter <- NULL
storage.mode(s$id) <- storage.mode(s$time) <- storage.mode(s$gap) <- 'integer'
simlongord  <- s
save(simlongord, file='simlongord.rda', compress='xz')
html(contents(simlongord))| Name | Levels | Storage | 
|---|---|---|
| id | integer | |
| time | integer | |
| gap | integer | |
| yprev | 3 | integer | 
| y | 4 | integer | 
| age | integer | |
| sofa | integer | |
| tx | integer | 
h <- vglm(ordered(y) ~ yprev + lsp(time, 2) + age + sofa + tx,
              cumulative(reverse=TRUE, parallel=FALSE ~ lsp(time, 2)),
              data=subset(s, id <= 20000))
round(cbind(Actual=coef(f), Simulated=coef(h)), 3)                             Actual Simulated
(Intercept):1                 4.713     4.674
(Intercept):2                -1.741    -1.884
(Intercept):3               -10.034   -10.005
pstatusIn Hospital/Facility   6.364     6.354
pstatusHome                  15.797    15.780
lsp(day, 2)day:1             -0.412    -0.398
lsp(day, 2)day:2              0.481     0.548
lsp(day, 2)day:3              1.315     1.295
lsp(day, 2)day':1             0.410     0.393
lsp(day, 2)day':2            -0.525    -0.590
lsp(day, 2)day':3            -1.384    -1.363
age                          -0.011    -0.010
sofa                         -0.046    -0.048
txB                           0.028     0.291
The coefficients have good agreement except for the treatment effect, which we manipulated. The estimate is very close to \(\log(\frac{1}{0.75})\).
The simulated dataset is placed in hbiostat.org/data and can be easily downloaded using the Hmisc package getHdata function. Here’s example code for downloading the data and setting up to analyze a simulated clinical trial of 1000 patients. The outcome, previous state, and time variables are named, respectively, y, yprev, time.
require(Hmisc)
getHdata(simlongord)
s <- subset(simlongord, id <= 500)
Save the first 500 patients permanently.
simlongord500 <- subset(simlongord, id <= 500)
save(simlongord500, file='simlongord500.rda', compress='xz')
# To access the subset:
# getHdata(simlongord500)R version 4.2.1 (2022-06-23) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Pop!_OS 22.04 LTS Matrix products: default BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0 LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0 attached base packages: [1] splines stats4 stats graphics grDevices utils datasets [8] methods base other attached packages: [1] VGAM_1.1-6 data.table_1.14.2 rms_6.3-1 SparseM_1.81 [5] Hmisc_4.7-1 ggplot2_3.3.5 Formula_1.2-4 survival_3.3-1 [9] lattice_0.20-45To cite R in publications use:
R Core Team (2022). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/.
To cite the Hmisc package in publications use:Harrell Jr F (2022). Hmisc: Harrell Miscellaneous. R package version 4.7-1, https://hbiostat.org/R/Hmisc/.
To cite the rms package in publications use:Harrell Jr FE (2022). rms: Regression Modeling Strategies. https://hbiostat.org/R/rms/, https://github.com/harrelfe/rms.
To cite the VGAM package in publications use:Yee TW (2015). Vector Generalized Linear and Additive Models: With an Implementation in R. Springer, New York, USA.
To cite the data.table package in publications use:Dowle M, Srinivasan A (2021). data.table: Extension of 'data.frame'. R package version 1.14.2, https://CRAN.R-project.org/package=data.table.
To cite the ggplot2 package in publications use:Wickham H (2016). ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. ISBN 978-3-319-24277-4, https://ggplot2.tidyverse.org.
To cite the survival package in publications use:Therneau T (2022). A Package for Survival Analysis in R. R package version 3.3-1, https://CRAN.R-project.org/package=survival.