--- title: Simulating a Large Clinical Trial From VIOLET author: - name: Frank Harrell url: https://hbiostat.org affiliation: Vanderbilt University Department of Biostatistics date: last-modified format: html: self-contained: true anchor-sections: true code-tools: true code-fold: true fig-width: 8 fig-height: 4 code-block-bg: "#f1f3f5" code-block-border-left: "#31BAE9" mainfont: Source Sans Pro theme: journal toc: true toc-depth: 3 toc-location: left captions: true cap-location: margin table-captions: true tbl-cap-location: margin reference-location: margin comments: hypothesis: false execute: warning: false message: false --- # Overview VIOLET was a randomized clinical trial of seriously ill adults in ICUs to study the effectiveness of vitamin D vs. placebo. Vitamin D was not effective. VIOLET 2 is a rich dataset having daily ordinal outcomes assessed for 28 days. Due to an apparent anomaly in ventilator use on day 28 we restrict attention to days 1-27. The original [NEJM paper](https://www.nejm.org/doi/full/10.1056/NEJMoa1911124) focused on 1078 patients with confirmed baseline vitamin D deficiency. The analyses presented here use 1352 of the original 1360 randomized patients. We fit a partial proportional odds Markov transition model to form a simulation model. The original data may be requested [here](https://biolincc.nhlbi.nih.gov/studies/petal_violet). We simulate a 250,000 patient clinical trial using a Markov proportional odds ordinal longitudinal model fitted to the VIOLET study, including a limited number of covariates randomly sampled from the study and overriding the treatment effect on transition odds to have an odds ratio of 0.75 (when OR < 1 denotes benefit). The proportional odds assumption is relaxed for the time effect. The first-order Markov model is fitted with a partial proportional odds model using the `VGAM` package. Simulating a very large number of patients allows users to do repeated sampling as in simulations, and to analyze any ordinary sample size. Extensive re-analyses of VIOLET may be found at [hbiostat.org/proj/covid19](https://hbiostat.org/proj/covid19). ```{r setup} require(rms) # also gets access to Hmisc require(data.table) # used to compute various summary measures require(VGAM) knitrSet(lang='quarto') ``` # Data Setup and Baseline Descriptive Statistics ```{r violet2} dcf <- readRDS('dcf.rds') setDT(dcf, key=c('id', 'day')) # includes records with death carried forward dcf <- dcf[! is.na(bstatus) & day < 28, .(id, day, bstatus, status, treatment, ddeath, age, lips, charlson, sofa, vitD)] # Since baseline status did not distinguish ARDS from on ventilator, # we must pool follow-up status likewise s <- dcf[, bstatus] levels(s) <- list('Vent/ARDS'='ARDS', 'Vent/ARDS'='On Vent', 'In Hospital/Facility'='In Hospital') dcf[, table(bstatus, s)] dcf[, bstatus := s] d <- dcf d[, pstatus := shift(status), by=id] d[is.na(pstatus), pstatus := bstatus] # Don't carry death forward when fitting Markov models: d <- d[day <= ddeath] setkey(d, id, day) # Create a new treatment A B variable d[, tx := c(Placebo='A', 'Vitamin D'='B')[treatment]] d[, treatment := NULL] ``` ```{r desc,results='asis'} html(describe(d[day == 1, ], 'Baseline Data')) ``` # Fitting a State Transition Model to VIOLET Fit a partial PO model, with no PO assumption for the time effect. This allows different mixes of outcome categories over time. ```{r ppo} f <- vglm(ordered(status) ~ pstatus + lsp(day, 2) + age + sofa + tx, cumulative(reverse=TRUE, parallel=FALSE ~ lsp(day, 2)), data=d) se <- sqrt(diag(vcov(f))) round(cbind(beta=coef(f), se=se, z=coef(f) / se), 3) ``` The non-PO part of the model has 6 parameters: 2 time components $\times$ 3 states corresponding to PO model intercepts. Two time components were required to accommodate the very low number of patients who were sent home on the first day. The linear spline used here essentially grants an exception at $t=1$. # Simulation Model From the Study First we have to decide the covariate distributions to simulate from. Let's sample with replacement from the whole set of covariates. Fetch baseline records, making sure that the first record for each patient has `day=1`. Let's re-randomize patients since we are going to ignore the observed treatment effect anyway. Then sample with replacement for simulating a trial of `N` = 100,000 patients. When sampling baseline covariates we must also sample the baseline states. ```{r covars} N <- 250000 w <- d[, .(first=min(day)), by=id] w[, range(first)] set.seed(8) w <- d[day == 1, .(age, sofa, bstatus, tx=sample(0:1, .N, replace=TRUE))] # Expand to a large number of patients by sampling with replacement X <- w[, cbind(age, sofa, tx)] init <- w[, bstatus] i <- sample(1 : nrow(X), N, replace=TRUE) X <- X[i, ] init <- init[i] ``` Noting that he `vgam` function's non-PO terms are parameterized to pertain to each category after the first (Dead). We write a `g` function for the state transitions, and save intercepts separately. Then compose a function to simulate one transition for one patient. ```{r vpo} k <- coef(f) ints <- k[1 : 3] extra <- k[- (1 : 3)] extra <- extra[-length(extra)] # remove tx coefficient g <- function(yprev, t, gap, X, parameter=0, extra) { tau <- extra[1:2] # main intercepts kappa <- extra[3:8] # time effects including non-PO beta <- extra[9:10] # baseline covariates lp <- matrix(0., nrow=length(yprev), ncol=3, dimnames=list(as.character(yprev), c('Vent/ARDS', 'In Hospital/Facility', 'Home'))) tp <- pmax(t - 2, 0) # second time term, from linear spline knot at t=2 # 3 columns = no. distinct y less 1 = length of intercepts # lp[yp, ] is a 3-vector because the kappa components are split out into a 3-vector if(is.matrix(X)) stop('X not set up correctly') for(yp in yprev) lp[as.character(yp), ] <- tau[1] * (yp == 'In Hospital/Facility') + tau[2] * (yp == 'Home') + t * kappa[1:3] + tp * kappa[4:6] + beta[1] * X[1] + beta[2] * X[2] + parameter * X[3] lp } formals(g)$extra <- extra ``` # Simulate Data Simulate 28 consecutive days for each of the `N` new patients, using the covariate/treatment/initial state combinations we sampled with replacement to expand to `N`. But first simulate 10,000 patients with death carried forward and compare distributions to observed data. Note that the treatment OR needs to be $\frac{1}{0.75}$ since outcomes are reverse coded. ```{r sim} n <- 10000 s <- simMarkovOrd(n=n, y=levels(d$status), times=1:28, initial=init[1:n], X=X[1:n,], absorb='Dead', intercepts=ints, g=g, parameter=-log(0.75), carry=TRUE) propsPO(y ~ time, data=subset(s, tx == 1)) + labs(caption='Simulated active arm data') propsPO(y ~ time, data=subset(s, tx == 0)) + labs(caption='Simulated placebo data') propsPO(status ~ day, data=dcf) + labs(caption='Actual data') ``` Take a detour to examine the correlation patterns in these simulated data in comparison to patterns in the real data. Spearman's rank correlation is used on the 4-level outcome. ```{r corpat} #| h: 6 #| w: 7.5 #| layout-ncol: 2 #| column: screen # Make short and wide w <- dcast(d[, .(id, day, y=as.numeric(status))], id ~ day, value.var='y') r <- cor(as.matrix(w[, -1]), use='pairwise.complete.obs', method='spearman') plotCorrM(r)[[1]] + labs(caption='Spearman correlations for observed data') + theme(legend.position='none') # Repeat for simulated data setDT(s, key=c('id', 'time')) w <- dcast(s[time < 28, .(id, time, y=as.numeric(y))], id ~ time, value.var='y') rs <- cor(as.matrix(w[, -1]), use='pairwise.complete.obs', method='spearman') plotCorrM(rs)[[1]] + labs(caption='Spearman correlations for simulated data') + theme(legend.position='none') saveRDS(list(r.actual=r, r.simulated=rs), 'vcorr.rds', compress='xz') ``` Compute mean absolute difference between correlation coefficients as a function of day ```{r madr} ad <- abs(r - rs) round(apply(ad, 1, mean), 2) ``` There is excellent agreement between actual and assumed correlation patterns except for correlations involving day 1. Simulate the large trial without carrying death forward. Compare coefficients fitted on the first 20,000 patients in the simulated sample to those from fitting the real data. ```{r realsim,results='asis'} s <- simMarkovOrd(n=N, y=levels(d$status), times=1:28, initial=init, X=X, absorb='Dead', intercepts=ints, g=g, parameter=-log(0.75)) s$parameter <- NULL storage.mode(s$id) <- storage.mode(s$time) <- storage.mode(s$gap) <- 'integer' simlongord <- s save(simlongord, file='simlongord.rda', compress='xz') html(contents(simlongord)) ``` ```{r comparefits} h <- vglm(ordered(y) ~ yprev + lsp(time, 2) + age + sofa + tx, cumulative(reverse=TRUE, parallel=FALSE ~ lsp(time, 2)), data=subset(s, id <= 20000)) round(cbind(Actual=coef(f), Simulated=coef(h)), 3) ``` The coefficients have good agreement except for the treatment effect, which we manipulated. The estimate is very close to $\log(\frac{1}{0.75})$. # Using the Simulated Data The simulated dataset is placed in [hbiostat.org/data](https://hbiostat.org/data) and can be easily downloaded using the `Hmisc` package `getHdata` function. Here's example code for downloading the data and setting up to analyze a simulated clinical trial of 1000 patients. The outcome, previous state, and time variables are named, respectively, `y`, `yprev`, `time`. ``` require(Hmisc) getHdata(simlongord) s <- subset(simlongord, id <= 500) ``` Save the first 500 patients permanently. ```{r save500} simlongord500 <- subset(simlongord, id <= 500) save(simlongord500, file='simlongord500.rda', compress='xz') # To access the subset: # getHdata(simlongord500) ``` # Computing Environment `r markupSpecs$html$session()`