Simulating a Large Clinical Trial From VIOLET

Author
Affiliation

Vanderbilt University Department of Biostatistics

Published

July 6, 2022

Overview

VIOLET was a randomized clinical trial of seriously ill adults in ICUs to study the effectiveness of vitamin D vs. placebo. Vitamin D was not effective. VIOLET 2 is a rich dataset having daily ordinal outcomes assessed for 28 days. Due to an apparent anomaly in ventilator use on day 28 we restrict attention to days 1-27. The original NEJM paper focused on 1078 patients with confirmed baseline vitamin D deficiency. The analyses presented here use 1352 of the original 1360 randomized patients. We fit a partial proportional odds Markov transition model to form a simulation model. The original data may be requested here.

We simulate a 250,000 patient clinical trial using a Markov proportional odds ordinal longitudinal model fitted to the VIOLET study, including a limited number of covariates randomly sampled from the study and overriding the treatment effect on transition odds to have an odds ratio of 0.75 (when OR < 1 denotes benefit). The proportional odds assumption is relaxed for the time effect. The first-order Markov model is fitted with a partial proportional odds model using the VGAM package. Simulating a very large number of patients allows users to do repeated sampling as in simulations, and to analyze any ordinary sample size.

Extensive re-analyses of VIOLET may be found at hbiostat.org/proj/covid19.

Code
require(rms)          # also gets access to Hmisc
require(data.table)   # used to compute various summary measures
require(VGAM)
knitrSet(lang='quarto')

Data Setup and Baseline Descriptive Statistics

Code
dcf <- readRDS('dcf.rds')
setDT(dcf, key=c('id', 'day'))   # includes records with death carried forward
dcf <- dcf[! is.na(bstatus) & day < 28,
           .(id, day, bstatus, status, treatment, ddeath, age, lips,
             charlson, sofa, vitD)]
# Since baseline status did not distinguish ARDS from on ventilator,
# we must pool follow-up status likewise
s <- dcf[, bstatus]
levels(s) <- list('Vent/ARDS'='ARDS', 'Vent/ARDS'='On Vent',
                  'In Hospital/Facility'='In Hospital')
dcf[, table(bstatus, s)]
             s
bstatus       Vent/ARDS In Hospital/Facility
  ARDS             2727                    0
  On Vent         10071                    0
  In Hospital         0                23679
Code
dcf[, bstatus := s]
d   <- dcf
d[, pstatus := shift(status), by=id]
d[is.na(pstatus), pstatus := bstatus]

# Don't carry death forward when fitting Markov models:
d <- d[day <= ddeath]
setkey(d, id, day)
# Create a new treatment A B variable
d[, tx := c(Placebo='A', 'Vitamin D'='B')[treatment]]
d[, treatment := NULL]
Code
html(describe(d[day == 1, ], 'Baseline Data'))
Baseline Data Descriptives
Baseline Data

12 Variables   1351 Observations

id: Reference Id
nmissingdistinct
135101351
lowest :A01-00208A01-00210A01-00261A01-00337A01-00354
highest:W05-00581W05-00589W05-00590W05-00591W05-00592

day: Day
nmissingdistinctInfoMeanGmd
135101010
 Value         1
 Frequency  1351
 Proportion    1
 

bstatus: Baseline Status
nmissingdistinct
135102
 Value                 Vent/ARDS In Hospital/Facility
 Frequency                   474                  877
 Proportion                0.351                0.649
 

status
image
nmissingdistinct
135104
 Value                      Dead            Vent/ARDS In Hospital/Facility
 Frequency                    16                  421                  877
 Proportion                0.012                0.312                0.649
                                
 Value                      Home
 Frequency                    37
 Proportion                0.027
 

ddeath: Day of Death (29 if alive)
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
13510280.37526.234.925 5132929292929
lowest : 1 2 3 4 5 , highest: 25 26 27 28 29
age: SCR age enrolled
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
1351078156.3118.1627.033.046.058.067.576.082.0
lowest : 18 19 20 21 22 , highest: 91 92 93 94 96
lips: Baseline LIPS Score
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
13510280.9965.1763.402 0 0 3 5 7 910
lowest : 0.0 1.0 1.5 2.0 2.5 , highest: 12.0 12.5 13.0 13.5 14.5
charlson: Baseline Charlson Score
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
130942170.9873.8223.2450023589
lowest : 0 1 2 3 4 , highest: 12 13 14 15 18
 Value          0     1     2     3     4     5     6     7     8     9    10    11
 Frequency    172   154   166   168   170   154    99    74    61    32    23    10
 Proportion 0.131 0.118 0.127 0.128 0.130 0.118 0.076 0.057 0.047 0.024 0.018 0.008
                                         
 Value         12    13    14    15    18
 Frequency     14     6     4     1     1
 Proportion 0.011 0.005 0.003 0.001 0.001
 

sofa: SOFA Total Score
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
13510190.9935.3774.029 0 1 3 5 81012
lowest : 0 1 2 3 4 , highest: 14 15 16 17 18
 Value          0     1     2     3     4     5     6     7     8     9    10    11
 Frequency     87   123   112   136   158   134   135   106    89    76    65    54
 Proportion 0.064 0.091 0.083 0.101 0.117 0.099 0.100 0.078 0.066 0.056 0.048 0.040
                                                     
 Value         12    13    14    15    16    17    18
 Frequency     32    17    11     6     5     3     2
 Proportion 0.024 0.013 0.008 0.004 0.004 0.002 0.001
 

vitD: LCMS Day 0
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
130051554113.748.285 3.898 5.163 8.04712.60017.80023.31027.300
lowest : 0.0 0.6 0.8 1.3 1.5 , highest: 48.3 56.2 57.0 67.6 68.0
pstatus
nmissingdistinct
135102
 Value                 Vent/ARDS In Hospital/Facility
 Frequency                   474                  877
 Proportion                0.351                0.649
 

tx
nmissingdistinct
135102
 Value          A     B
 Frequency    666   685
 Proportion 0.493 0.507
 

Fitting a State Transition Model to VIOLET

Fit a partial PO model, with no PO assumption for the time effect. This allows different mixes of outcome categories over time.

Code
f <- vglm(ordered(status) ~ pstatus + lsp(day, 2) + age + sofa + tx,
              cumulative(reverse=TRUE, parallel=FALSE ~ lsp(day, 2)),
              data=d)
se <- sqrt(diag(vcov(f)))
round(cbind(beta=coef(f), se=se, z=coef(f) / se), 3)
                               beta    se       z
(Intercept):1                 4.713 0.531   8.869
(Intercept):2                -1.741 0.292  -5.967
(Intercept):3               -10.034 0.361 -27.800
pstatusIn Hospital/Facility   6.364 0.098  64.706
pstatusHome                  15.797 0.229  68.971
lsp(day, 2)day:1             -0.412 0.280  -1.475
lsp(day, 2)day:2              0.481 0.150   3.218
lsp(day, 2)day:3              1.315 0.175   7.533
lsp(day, 2)day':1             0.410 0.283   1.447
lsp(day, 2)day':2            -0.525 0.152  -3.456
lsp(day, 2)day':3            -1.384 0.176  -7.868
age                          -0.011 0.002  -6.777
sofa                         -0.046 0.007  -6.525
txB                           0.028 0.050   0.560

The non-PO part of the model has 6 parameters: 2 time components \(\times\) 3 states corresponding to PO model intercepts. Two time components were required to accommodate the very low number of patients who were sent home on the first day. The linear spline used here essentially grants an exception at \(t=1\).

Simulation Model From the Study

First we have to decide the covariate distributions to simulate from. Let’s sample with replacement from the whole set of covariates. Fetch baseline records, making sure that the first record for each patient has day=1. Let’s re-randomize patients since we are going to ignore the observed treatment effect anyway. Then sample with replacement for simulating a trial of N = 100,000 patients.

When sampling baseline covariates we must also sample the baseline states.

Code
N <- 250000
w <- d[, .(first=min(day)), by=id]
w[, range(first)]
[1] 1 1
Code
set.seed(8)
w <- d[day == 1, .(age, sofa, bstatus, tx=sample(0:1, .N, replace=TRUE))]
# Expand to a large number of patients by sampling with replacement
X <- w[, cbind(age, sofa, tx)]
init <- w[, bstatus]
i    <- sample(1 : nrow(X), N, replace=TRUE)
X    <- X[i, ]
init <- init[i]

Noting that he vgam function’s non-PO terms are parameterized to pertain to each category after the first (Dead). We write a g function for the state transitions, and save intercepts separately. Then compose a function to simulate one transition for one patient.

Code
k     <- coef(f)
ints  <- k[1 : 3]
extra <- k[- (1 : 3)]
extra <- extra[-length(extra)] # remove tx coefficient
g <- function(yprev, t, gap, X, parameter=0, extra) {
    tau   <- extra[1:2]   # main intercepts
    kappa <- extra[3:8]   # time effects including non-PO
    beta  <- extra[9:10]  # baseline covariates
    lp <- matrix(0., nrow=length(yprev), ncol=3,
                 dimnames=list(as.character(yprev),
                 c('Vent/ARDS', 'In Hospital/Facility', 'Home')))
    tp <- pmax(t - 2, 0)   # second time term, from linear spline knot at t=2
    # 3 columns = no. distinct y less 1 = length of intercepts
    # lp[yp, ] is a 3-vector because the kappa components are split out into a 3-vector
    if(is.matrix(X)) stop('X not set up correctly')
    for(yp in yprev)
      lp[as.character(yp), ] <- tau[1] * (yp == 'In Hospital/Facility') +
                                  tau[2] * (yp == 'Home') +
                                t * kappa[1:3] + tp * kappa[4:6] +
                                beta[1] * X[1] + beta[2] * X[2] +
                                    parameter * X[3]
  lp
}
formals(g)$extra <- extra

Simulate Data

Simulate 28 consecutive days for each of the N new patients, using the covariate/treatment/initial state combinations we sampled with replacement to expand to N. But first simulate 10,000 patients with death carried forward and compare distributions to observed data. Note that the treatment OR needs to be \(\frac{1}{0.75}\) since outcomes are reverse coded.

Code
n <- 10000
s <- simMarkovOrd(n=n, y=levels(d$status), times=1:28,
                  initial=init[1:n], X=X[1:n,], absorb='Dead',
                  intercepts=ints, g=g, parameter=-log(0.75), carry=TRUE)
propsPO(y      ~ time, data=subset(s, tx == 1)) +
                    labs(caption='Simulated active arm data')
propsPO(y      ~ time, data=subset(s, tx == 0)) +
                    labs(caption='Simulated placebo data')
propsPO(status ~ day,  data=dcf) + labs(caption='Actual data')

Take a detour to examine the correlation patterns in these simulated data in comparison to patterns in the real data. Spearman’s rank correlation is used on the 4-level outcome.

Code
# Make short and wide 
w <- dcast(d[, .(id, day, y=as.numeric(status))],
           id ~ day, value.var='y')
r <- cor(as.matrix(w[, -1]), use='pairwise.complete.obs',
         method='spearman')
plotCorrM(r)[[1]] +
  labs(caption='Spearman correlations for observed data') +
  theme(legend.position='none')
# Repeat for simulated data
setDT(s, key=c('id', 'time'))
w <- dcast(s[time < 28, .(id, time, y=as.numeric(y))],
           id ~ time, value.var='y')
rs <- cor(as.matrix(w[, -1]), use='pairwise.complete.obs',
          method='spearman')
plotCorrM(rs)[[1]] + 
  labs(caption='Spearman correlations for simulated data') +
  theme(legend.position='none')
saveRDS(list(r.actual=r, r.simulated=rs),
        'vcorr.rds', compress='xz')

Compute mean absolute difference between correlation coefficients as a function of day

Code
ad <- abs(r - rs)
round(apply(ad, 1, mean), 2)
   1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16 
0.09 0.05 0.05 0.05 0.04 0.03 0.03 0.02 0.02 0.02 0.02 0.02 0.02 0.03 0.02 0.02 
  17   18   19   20   21   22   23   24   25   26   27 
0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 

There is excellent agreement between actual and assumed correlation patterns except for correlations involving day 1.

Simulate the large trial without carrying death forward. Compare coefficients fitted on the first 20,000 patients in the simulated sample to those from fitting the real data.

Code
s <- simMarkovOrd(n=N, y=levels(d$status), times=1:28,
                  initial=init, X=X, absorb='Dead',
                  intercepts=ints, g=g, parameter=-log(0.75))
s$parameter <- NULL
storage.mode(s$id) <- storage.mode(s$time) <- storage.mode(s$gap) <- 'integer'

simlongord  <- s
save(simlongord, file='simlongord.rda', compress='xz')
html(contents(simlongord))
simlongord Contents

Data frame:simlongord

6481674 observations and 8 variables, maximum # NAs:0  
NameLevelsStorage
idinteger
timeinteger
gapinteger
yprev3integer
y4integer
ageinteger
sofainteger
txinteger

Category Levels
yprev
  • Vent/ARDS
  • In Hospital/Facility
  • Home
  • y
  • Dead
  • Vent/ARDS
  • In Hospital/Facility
  • Home
  • Code
    h <- vglm(ordered(y) ~ yprev + lsp(time, 2) + age + sofa + tx,
                  cumulative(reverse=TRUE, parallel=FALSE ~ lsp(time, 2)),
                  data=subset(s, id <= 20000))
    
    round(cbind(Actual=coef(f), Simulated=coef(h)), 3)
                                 Actual Simulated
    (Intercept):1                 4.713     4.674
    (Intercept):2                -1.741    -1.884
    (Intercept):3               -10.034   -10.005
    pstatusIn Hospital/Facility   6.364     6.354
    pstatusHome                  15.797    15.780
    lsp(day, 2)day:1             -0.412    -0.398
    lsp(day, 2)day:2              0.481     0.548
    lsp(day, 2)day:3              1.315     1.295
    lsp(day, 2)day':1             0.410     0.393
    lsp(day, 2)day':2            -0.525    -0.590
    lsp(day, 2)day':3            -1.384    -1.363
    age                          -0.011    -0.010
    sofa                         -0.046    -0.048
    txB                           0.028     0.291

    The coefficients have good agreement except for the treatment effect, which we manipulated. The estimate is very close to \(\log(\frac{1}{0.75})\).

    Using the Simulated Data

    The simulated dataset is placed in hbiostat.org/data and can be easily downloaded using the Hmisc package getHdata function. Here’s example code for downloading the data and setting up to analyze a simulated clinical trial of 1000 patients. The outcome, previous state, and time variables are named, respectively, y, yprev, time.

    require(Hmisc)
    getHdata(simlongord)
    s <- subset(simlongord, id <= 500)

    Save the first 500 patients permanently.

    Code
    simlongord500 <- subset(simlongord, id <= 500)
    save(simlongord500, file='simlongord500.rda', compress='xz')
    # To access the subset:
    # getHdata(simlongord500)

    Computing Environment

     R version 4.2.1 (2022-06-23)
     Platform: x86_64-pc-linux-gnu (64-bit)
     Running under: Pop!_OS 22.04 LTS
     
     Matrix products: default
     BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0
     LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
     
     attached base packages:
     [1] splines   stats4    stats     graphics  grDevices utils     datasets 
     [8] methods   base     
     
     other attached packages:
     [1] VGAM_1.1-6        data.table_1.14.2 rms_6.3-1         SparseM_1.81     
     [5] Hmisc_4.7-1       ggplot2_3.3.5     Formula_1.2-4     survival_3.3-1   
     [9] lattice_0.20-45  
     
    To cite R in publications use:

    R Core Team (2022). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/.

    To cite the Hmisc package in publications use:

    Harrell Jr F (2022). Hmisc: Harrell Miscellaneous. R package version 4.7-1, https://hbiostat.org/R/Hmisc/.

    To cite the rms package in publications use:

    Harrell Jr FE (2022). rms: Regression Modeling Strategies. https://hbiostat.org/R/rms/, https://github.com/harrelfe/rms.

    To cite the VGAM package in publications use:

    Yee TW (2015). Vector Generalized Linear and Additive Models: With an Implementation in R. Springer, New York, USA.

    To cite the data.table package in publications use:

    Dowle M, Srinivasan A (2021). data.table: Extension of 'data.frame'. R package version 1.14.2, https://CRAN.R-project.org/package=data.table.

    To cite the ggplot2 package in publications use:

    Wickham H (2016). ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. ISBN 978-3-319-24277-4, https://ggplot2.tidyverse.org.

    To cite the survival package in publications use:

    Therneau T (2022). A Package for Survival Analysis in R. R package version 3.3-1, https://CRAN.R-project.org/package=survival.