Simulating a Large Clinical Trial From VIOLET

Author

Affiliation

Vanderbilt University Department of Biostatistics

Published

July 6, 2022

Overview

VIOLET was a randomized clinical trial of seriously ill adults in ICUs to study the effectiveness of vitamin D vs. placebo. Vitamin D was not effective. VIOLET 2 is a rich dataset having daily ordinal outcomes assessed for 28 days. Due to an apparent anomaly in ventilator use on day 28 we restrict attention to days 1-27. The original NEJM paper focused on 1078 patients with confirmed baseline vitamin D deficiency. The analyses presented here use 1352 of the original 1360 randomized patients. We fit a partial proportional odds Markov transition model to form a simulation model. The original data may be requested here.

We simulate a 250,000 patient clinical trial using a Markov proportional odds ordinal longitudinal model fitted to the VIOLET study, including a limited number of covariates randomly sampled from the study and overriding the treatment effect on transition odds to have an odds ratio of 0.75 (when OR < 1 denotes benefit). The proportional odds assumption is relaxed for the time effect. The first-order Markov model is fitted with a partial proportional odds model using the VGAM package. Simulating a very large number of patients allows users to do repeated sampling as in simulations, and to analyze any ordinary sample size.

Extensive re-analyses of VIOLET may be found at hbiostat.org/proj/covid19.

Code

require(rms)          # also gets access to Hmisc
require(data.table)   # used to compute various summary measures
require(VGAM)
knitrSet(lang='quarto')

Data Setup and Baseline Descriptive Statistics

Code

dcf <- readRDS('dcf.rds')
setDT(dcf, key=c('id', 'day'))   # includes records with death carried forward
dcf <- dcf[! is.na(bstatus) & day < 28,
           .(id, day, bstatus, status, treatment, ddeath, age, lips,
             charlson, sofa, vitD)]
# Since baseline status did not distinguish ARDS from on ventilator,
# we must pool follow-up status likewise
s <- dcf[, bstatus]
levels(s) <- list('Vent/ARDS'='ARDS', 'Vent/ARDS'='On Vent',
                  'In Hospital/Facility'='In Hospital')
dcf[, table(bstatus, s)]

             s
bstatus       Vent/ARDS In Hospital/Facility
  ARDS             2727                    0
  On Vent         10071                    0
  In Hospital         0                23679

Code

dcf[, bstatus := s]
d   <- dcf
d[, pstatus := shift(status), by=id]
d[is.na(pstatus), pstatus := bstatus]

# Don't carry death forward when fitting Markov models:
d <- d[day <= ddeath]
setkey(d, id, day)
# Create a new treatment A B variable
d[, tx := c(Placebo='A', 'Vitamin D'='B')[treatment]]
d[, treatment := NULL]

Code

html(describe(d[day == 1, ], 'Baseline Data'))

Baseline Data Descriptives

Baseline Data

12 Variables 1351 Observations

id: Reference Id

n	missing	distinct
1351	0	1351

lowest :	A01-00208	A01-00210	A01-00261	A01-00337	A01-00354
highest:	W05-00581	W05-00589	W05-00590	W05-00591	W05-00592

day: Day

n	missing	distinct	Info	Mean	Gmd
1351	0	1	0	1	0

 Value         1
 Frequency  1351
 Proportion    1

bstatus: Baseline Status

n	missing	distinct
1351	0	2

 Value                 Vent/ARDS In Hospital/Facility
 Frequency                   474                  877
 Proportion                0.351                0.649

status

n	missing	distinct
1351	0	4

 Value                      Dead            Vent/ARDS In Hospital/Facility
 Frequency                    16                  421                  877
 Proportion                0.012                0.312                0.649
                                
 Value                      Home
 Frequency                    37
 Proportion                0.027

ddeath: Day of Death (29 if alive)

n	missing	distinct	Info	Mean	Gmd	.05	.10	.25	.50	.75	.90	.95
1351	0	28	0.375	26.23	4.925	5	13	29	29	29	29	29

lowest : 1 2 3 4 5 , highest: 25 26 27 28 29

age: SCR age enrolled

n	missing	distinct	Info	Mean	Gmd	.05	.10	.25	.50	.75	.90	.95
1351	0	78	1	56.31	18.16	27.0	33.0	46.0	58.0	67.5	76.0	82.0

lowest : 18 19 20 21 22 , highest: 91 92 93 94 96

lips: Baseline LIPS Score

n	missing	distinct	Info	Mean	Gmd	.05	.10	.25	.50	.75	.90	.95
1351	0	28	0.996	5.176	3.402	0	0	3	5	7	9	10

lowest : 0.0 1.0 1.5 2.0 2.5 , highest: 12.0 12.5 13.0 13.5 14.5

charlson: Baseline Charlson Score

n	missing	distinct	Info	Mean	Gmd	.05	.10	.25	.50	.75	.90	.95
1309	42	17	0.987	3.822	3.245	0	0	2	3	5	8	9

lowest : 0 1 2 3 4 , highest: 12 13 14 15 18

 Value          0     1     2     3     4     5     6     7     8     9    10    11
 Frequency    172   154   166   168   170   154    99    74    61    32    23    10
 Proportion 0.131 0.118 0.127 0.128 0.130 0.118 0.076 0.057 0.047 0.024 0.018 0.008
                                         
 Value         12    13    14    15    18
 Frequency     14     6     4     1     1
 Proportion 0.011 0.005 0.003 0.001 0.001

sofa: SOFA Total Score

n	missing	distinct	Info	Mean	Gmd	.05	.10	.25	.50	.75	.90	.95
1351	0	19	0.993	5.377	4.029	0	1	3	5	8	10	12

lowest : 0 1 2 3 4 , highest: 14 15 16 17 18

 Value          0     1     2     3     4     5     6     7     8     9    10    11
 Frequency     87   123   112   136   158   134   135   106    89    76    65    54
 Proportion 0.064 0.091 0.083 0.101 0.117 0.099 0.100 0.078 0.066 0.056 0.048 0.040
                                                     
 Value         12    13    14    15    16    17    18
 Frequency     32    17    11     6     5     3     2
 Proportion 0.024 0.013 0.008 0.004 0.004 0.002 0.001

vitD: LCMS Day 0

n	missing	distinct	Info	Mean	Gmd	.05	.10	.25	.50	.75	.90	.95
1300	51	554	1	13.74	8.285	3.898	5.163	8.047	12.600	17.800	23.310	27.300

lowest : 0.0 0.6 0.8 1.3 1.5 , highest: 48.3 56.2 57.0 67.6 68.0

pstatus

n	missing	distinct
1351	0	2

 Value                 Vent/ARDS In Hospital/Facility
 Frequency                   474                  877
 Proportion                0.351                0.649

n	missing	distinct
1351	0	2

 Value          A     B
 Frequency    666   685
 Proportion 0.493 0.507

Fitting a State Transition Model to VIOLET

Fit a partial PO model, with no PO assumption for the time effect. This allows different mixes of outcome categories over time.

Code

f <- vglm(ordered(status) ~ pstatus + lsp(day, 2) + age + sofa + tx,
              cumulative(reverse=TRUE, parallel=FALSE ~ lsp(day, 2)),
              data=d)
se <- sqrt(diag(vcov(f)))
round(cbind(beta=coef(f), se=se, z=coef(f) / se), 3)

                               beta    se       z
(Intercept):1                 4.713 0.531   8.869
(Intercept):2                -1.741 0.292  -5.967
(Intercept):3               -10.034 0.361 -27.800
pstatusIn Hospital/Facility   6.364 0.098  64.706
pstatusHome                  15.797 0.229  68.971
lsp(day, 2)day:1             -0.412 0.280  -1.475
lsp(day, 2)day:2              0.481 0.150   3.218
lsp(day, 2)day:3              1.315 0.175   7.533
lsp(day, 2)day':1             0.410 0.283   1.447
lsp(day, 2)day':2            -0.525 0.152  -3.456
lsp(day, 2)day':3            -1.384 0.176  -7.868
age                          -0.011 0.002  -6.777
sofa                         -0.046 0.007  -6.525
txB                           0.028 0.050   0.560

The non-PO part of the model has 6 parameters: 2 time components \(\times\) 3 states corresponding to PO model intercepts. Two time components were required to accommodate the very low number of patients who were sent home on the first day. The linear spline used here essentially grants an exception at \(t=1\).

Simulation Model From the Study

First we have to decide the covariate distributions to simulate from. Let’s sample with replacement from the whole set of covariates. Fetch baseline records, making sure that the first record for each patient has day=1. Let’s re-randomize patients since we are going to ignore the observed treatment effect anyway. Then sample with replacement for simulating a trial of N = 100,000 patients.

When sampling baseline covariates we must also sample the baseline states.

Code

N <- 250000
w <- d[, .(first=min(day)), by=id]
w[, range(first)]

[1] 1 1

Code

set.seed(8)
w <- d[day == 1, .(age, sofa, bstatus, tx=sample(0:1, .N, replace=TRUE))]
# Expand to a large number of patients by sampling with replacement
X <- w[, cbind(age, sofa, tx)]
init <- w[, bstatus]
i    <- sample(1 : nrow(X), N, replace=TRUE)
X    <- X[i, ]
init <- init[i]

Noting that he vgam function’s non-PO terms are parameterized to pertain to each category after the first (Dead). We write a g function for the state transitions, and save intercepts separately. Then compose a function to simulate one transition for one patient.

Code

k     <- coef(f)
ints  <- k[1 : 3]
extra <- k[- (1 : 3)]
extra <- extra[-length(extra)] # remove tx coefficient
g <- function(yprev, t, gap, X, parameter=0, extra) {
    tau   <- extra[1:2]   # main intercepts
    kappa <- extra[3:8]   # time effects including non-PO
    beta  <- extra[9:10]  # baseline covariates
    lp <- matrix(0., nrow=length(yprev), ncol=3,
                 dimnames=list(as.character(yprev),
                 c('Vent/ARDS', 'In Hospital/Facility', 'Home')))
    tp <- pmax(t - 2, 0)   # second time term, from linear spline knot at t=2
    # 3 columns = no. distinct y less 1 = length of intercepts
    # lp[yp, ] is a 3-vector because the kappa components are split out into a 3-vector
    if(is.matrix(X)) stop('X not set up correctly')
    for(yp in yprev)
      lp[as.character(yp), ] <- tau[1] * (yp == 'In Hospital/Facility') +
                                  tau[2] * (yp == 'Home') +
                                t * kappa[1:3] + tp * kappa[4:6] +
                                beta[1] * X[1] + beta[2] * X[2] +
                                    parameter * X[3]
  lp
}
formals(g)$extra <- extra

Simulate Data

Simulate 28 consecutive days for each of the N new patients, using the covariate/treatment/initial state combinations we sampled with replacement to expand to N. But first simulate 10,000 patients with death carried forward and compare distributions to observed data. Note that the treatment OR needs to be \(\frac{1}{0.75}\) since outcomes are reverse coded.

Code

n <- 10000
s <- simMarkovOrd(n=n, y=levels(d$status), times=1:28,
                  initial=init[1:n], X=X[1:n,], absorb='Dead',
                  intercepts=ints, g=g, parameter=-log(0.75), carry=TRUE)
propsPO(y      ~ time, data=subset(s, tx == 1)) +
                    labs(caption='Simulated active arm data')
propsPO(y      ~ time, data=subset(s, tx == 0)) +
                    labs(caption='Simulated placebo data')
propsPO(status ~ day,  data=dcf) + labs(caption='Actual data')

Take a detour to examine the correlation patterns in these simulated data in comparison to patterns in the real data. Spearman’s rank correlation is used on the 4-level outcome.

Code

# Make short and wide 
w <- dcast(d[, .(id, day, y=as.numeric(status))],
           id ~ day, value.var='y')
r <- cor(as.matrix(w[, -1]), use='pairwise.complete.obs',
         method='spearman')
plotCorrM(r)[[1]] +
  labs(caption='Spearman correlations for observed data') +
  theme(legend.position='none')
# Repeat for simulated data
setDT(s, key=c('id', 'time'))
w <- dcast(s[time < 28, .(id, time, y=as.numeric(y))],
           id ~ time, value.var='y')
rs <- cor(as.matrix(w[, -1]), use='pairwise.complete.obs',
          method='spearman')
plotCorrM(rs)[[1]] + 
  labs(caption='Spearman correlations for simulated data') +
  theme(legend.position='none')
saveRDS(list(r.actual=r, r.simulated=rs),
        'vcorr.rds', compress='xz')

Compute mean absolute difference between correlation coefficients as a function of day

Code

ad <- abs(r - rs)
round(apply(ad, 1, mean), 2)

   1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16 
0.09 0.05 0.05 0.05 0.04 0.03 0.03 0.02 0.02 0.02 0.02 0.02 0.02 0.03 0.02 0.02 
  17   18   19   20   21   22   23   24   25   26   27 
0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02

There is excellent agreement between actual and assumed correlation patterns except for correlations involving day 1.

Simulate the large trial without carrying death forward. Compare coefficients fitted on the first 20,000 patients in the simulated sample to those from fitting the real data.

Code

s <- simMarkovOrd(n=N, y=levels(d$status), times=1:28,
                  initial=init, X=X, absorb='Dead',
                  intercepts=ints, g=g, parameter=-log(0.75))
s$parameter <- NULL
storage.mode(s$id) <- storage.mode(s$time) <- storage.mode(s$gap) <- 'integer'

simlongord  <- s
save(simlongord, file='simlongord.rda', compress='xz')
html(contents(simlongord))

simlongord Contents

Data frame:simlongord

6481674 observations and 8 variables, maximum # NAs:0

Name	Levels	Storage
id		integer
time		integer
gap		integer
yprev	3	integer
y	4	integer
age		integer
sofa		integer
tx		integer

Category Levels

yprev

Vent/ARDS

In Hospital/Facility

Home

y

Dead

Vent/ARDS

In Hospital/Facility

Home

Code

h <- vglm(ordered(y) ~ yprev + lsp(time, 2) + age + sofa + tx,
              cumulative(reverse=TRUE, parallel=FALSE ~ lsp(time, 2)),
              data=subset(s, id <= 20000))

round(cbind(Actual=coef(f), Simulated=coef(h)), 3)

                             Actual Simulated
(Intercept):1                 4.713     4.674
(Intercept):2                -1.741    -1.884
(Intercept):3               -10.034   -10.005
pstatusIn Hospital/Facility   6.364     6.354
pstatusHome                  15.797    15.780
lsp(day, 2)day:1             -0.412    -0.398
lsp(day, 2)day:2              0.481     0.548
lsp(day, 2)day:3              1.315     1.295
lsp(day, 2)day':1             0.410     0.393
lsp(day, 2)day':2            -0.525    -0.590
lsp(day, 2)day':3            -1.384    -1.363
age                          -0.011    -0.010
sofa                         -0.046    -0.048
txB                           0.028     0.291

The coefficients have good agreement except for the treatment effect, which we manipulated. The estimate is very close to \(\log(\frac{1}{0.75})\).

Using the Simulated Data

The simulated dataset is placed in hbiostat.org/data and can be easily downloaded using the Hmisc package getHdata function. Here’s example code for downloading the data and setting up to analyze a simulated clinical trial of 1000 patients. The outcome, previous state, and time variables are named, respectively, y, yprev, time.

require(Hmisc)
getHdata(simlongord)
s <- subset(simlongord, id <= 500)

Save the first 500 patients permanently.

Code

simlongord500 <- subset(simlongord, id <= 500)
save(simlongord500, file='simlongord500.rda', compress='xz')
# To access the subset:
# getHdata(simlongord500)

Computing Environment

 R version 4.2.1 (2022-06-23)
 Platform: x86_64-pc-linux-gnu (64-bit)
 Running under: Pop!_OS 22.04 LTS
 
 Matrix products: default
 BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0
 LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
 
 attached base packages:
 [1] splines   stats4    stats     graphics  grDevices utils     datasets 
 [8] methods   base     
 
 other attached packages:
 [1] VGAM_1.1-6        data.table_1.14.2 rms_6.3-1         SparseM_1.81     
 [5] Hmisc_4.7-1       ggplot2_3.3.5     Formula_1.2-4     survival_3.3-1   
 [9] lattice_0.20-45

To cite R in publications use:

R Core Team (2022). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/.

To cite the Hmisc package in publications use:

Harrell Jr F (2022). Hmisc: Harrell Miscellaneous. R package version 4.7-1, https://hbiostat.org/R/Hmisc/.

To cite the rms package in publications use:

Harrell Jr FE (2022). rms: Regression Modeling Strategies. https://hbiostat.org/R/rms/, https://github.com/harrelfe/rms.

To cite the VGAM package in publications use:

Yee TW (2015). Vector Generalized Linear and Additive Models: With an Implementation in R. Springer, New York, USA.

To cite the data.table package in publications use:

Dowle M, Srinivasan A (2021). data.table: Extension of 'data.frame'. R package version 1.14.2, https://CRAN.R-project.org/package=data.table.

To cite the ggplot2 package in publications use:

Wickham H (2016). ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. ISBN 978-3-319-24277-4, https://ggplot2.tidyverse.org.

To cite the survival package in publications use:

Therneau T (2022). A Package for Survival Analysis in R. R package version 3.3-1, https://CRAN.R-project.org/package=survival.