Background

The purpose of the code below is to create an annotated ready-to-analyze R data table from the NIH NIAID ACTT-1 COVID-19 remdesivir randomized clinical trial. Data were provided by NAIAD and EMMES Corp. to the Vanderbilt Department of Biostatistics COVID-19 CONNECTS team under a data transfer agreement, and those outside the team must make a request to NIAID to receive the SAS file. Then they can run the code below to produce an R file.

The ACTT1 final report is here.

Much thanks to Tyler Bonnett for providing guidance about the use of the source dataset. From Tyler regarding 11 records with blanks for ADYC in the dataset.

There were 11 ACTT-1 participants who were randomized but missing a baseline ordinal score assessment (and all subsequent assessments – these patients left the study very early and were censored for analyses). They show up in the dataset because they were included in the primary ITT analysis which stratified by baseline disease severity (mild-moderate vs severe) rather than baseline ordinal score.

The 11 patients all had day 15 ordinal outcome level 7. Why? Response 2021-04-13 from Jennifer Ferreira:

If a subject is missing an ordinal score at the Day 15 Study Visit, then an ordinal score is imputed for Day 15. The following imputations were used:

Subject is randomized but has no ordinal score data available for any timepoint (pre- or post-randomization): ordinal score = 7

Subject dies prior to study day 15: ordinal score = 8

Subject is discharged from the hospital not due to death or early termination and does not return for the Day 15 visit: ordinal score = 2

Subject is terminated from the study while hospitalized (termination not due to death) prior to Day 15 assessment: ordinal score = last observed ordinal score

Subject is discharged from the hospital prior to Day 15 assessment and discontinues the study on the same day as discharge: ordinal score = last observed ordinal score

Subject is discharged and readmitted to the hospital prior to Day 15 assessment and was still hospitalized at study day 15: ordinal score = 7

The SAS dataset was updated on 2021-03-25 to include the as-randomized ITT codes (variable arm). The original as-treated variable is actarm. Tyler Bonnett provided the following notes on the new dataset:

A reviewer of the January 29, 2021 dataset asked about multiple ordinal score values on the same day for two subjects:

Subject COV.01501 had their Day 15 visit at Day 14. The Day 22 visit was missed but the site entered the Day 14 status under Day 22 except with unknown oxygen usage. The data submitted under Day 15 visit is likely the most accurate. The data entered under the Day 22 visit should be ignored for analysis purposes.

Subject COV.01198 had their Day 25 visit when re-admitted. The Day 22 visit was missed but the site entered the Day 25 status under Day 22 except with no known oxygen usage. The data submitted under Day 25 visit is likely the most accurate. The data entered under the Day 22 visit should be ignored for analysis purposes.

Both of the extra data points submitted under the Day 22 visit for COV.01501 and COV.01198 have been removed from the updated dataset for clarity.

On 2021-03-19, Jennifer Ferreira provided some answers to earlier data questions:

I see some records past day 30, going all the way up to day 40. What’s the scoop on those?

Some subjects had their Day 29 visit late.

I see many patients with no day 15 record but with the day 15 status (the separate variable OR15SCOR) present, and sometimes the day 15 status does not equal the ordinal outcome status ORDSCOR from the most recent day (e.g., day 13). Can you say something about that?

In some cases the data is imputed based on their hospitalization status. E.g., if a subject’s last score was at Day 3, discharged on Day 4 and missed the Day 15 visit, their score would be imputed as a 2 (not hospitalized with limitations/home oxygen use).

Mat Makowski provided additional information on 2021-04-07:

While in general the day 15 evaluation is at ADYC=14 there are subjects who had drug dosing delayed by a day. These subjects would have their day 15 evaluation at ADYC=15 since that is 14 days after drug receipt. This is also the number difference you see for ADYC=1. Subjects who had drug delayed would start at ADYC=2 and not have an observation for ADYC=1.

Jennifer Ferreira provided additional information on 2021-04-13 and 2021-04-14:

The Day 15 ordinal score is collected anywhere from Day 13 to Day 15. Collection at Day 13 frequently occurred among those that were hospitalized and discharged prior to the Day 15 visit but within window so they did not come back for a follow-up visit. The Day 15 visit would assess the subject the day prior; however, the day as calculated would be based on the assessment date, e.g., Day 14 in the longitudinal data.

The time to death variable D29DTHE0 is one day off of the follow-up day (ADYC) at which death is first noted. Would you mind saying something about that?

The time to death is a direct calculation while the study day is the day +1. For example, if a subject enrolled on March 13th but died on March 19th their time to death would be 6 days but it would be on Study Day 7.

There are two patients where the time to death is more than one day off of the first longitudinal record noting death: COV.01522 and COV.01144.

Death was not always entered through the ordinal score form; however, it was always entered as an AE in the data system. The AE data was used to impute scores but scores were only imputed for days where we anticipated having an ordinal score. For example, if a subject died on Day 18, the death would be imputed on Day 22 as that is all we needed for the planned analyses.

On 2021-04-02 Jennifer Ferreira responded to additional questions:

I see on patient COV.00712 that on the 7th record for the patient, starting with record 1 = baseline = study day 0 (pre-randomization), the patient died and this is on study day 6. But d29dthe0 is 5. The patient died 6 days from time zero so how do you have a 5 for d29dthe0? I must be missing something simple, or the dataset uses an unconventional time origin. Perhaps death is considered to occur at midnight at the start of each day? So a death on day 1 (first day of assigned treatment) is counted as not surviving through noon of day 1 so is rounded down to day 0?

If a subject died the same day as randomization then they would be included in the analysis as dying at time 0 days from randomization as a full day hadn’t elapsed from randomization to the time of death. For the example subject, COV.00712, the subject died on Study Day 6 (with randomization Day as Day 1) which was 5 days after randomization. Please see the table below to help explain.

The outcome of the proportional hazards model comparisons is the same with or without the one day added in. However, median estimates from the model may differ by one day. Similarly, the RMST estimates would increase as well.

I want to confirm that after death, all dead patients have y=8 carried forward in this dataset, to the last planned follow-up day.

Deaths were imputed for select time periods for summaries by day, Days 3, 5, 8, 11, 15, 22 and 29. They were not carried forward every day through Day 29.

Mat Makowski answered the following on 2021-04-05:

From the study design, data collection was highlighted for days 3, 5, 8, 11, 15, 22, 29. Would you mind confirming that the ordinal outcomes were by and large really collected independently on all the consecutive days until hospital discharge, i.e., were not routinely carried forward for days not in the list above?

Yes you are correct that outside of the visits listed data points were not routinely carried forward and those are the values reported by the site. I hope this helps.

The time variable (SAS variable ADYC; R variable day) represents the day at which the ordinal scale (worst value) was assessed for the previous 24h period. Here are the instructions on the CSO (Clinical Status and New Score) data collection form:

From Jennifer Ferreira 2021-04-13 and 2021-04-14:

The record that says “Baseline” for the day ADYC is at the time of randomization; day 1 value is the post-randomization value. The ADYC column includes the assessment day as based on the date highlighted in the eCRF below.

From Jennifer Ferreira 2021-04-16:

ADYC doesn’t equal the collection/report date, it is the assessment date. It is collected/reported on Day 1, this corresponds to ADYC=Baseline in the dataset, similarly collection/reported Day 2 would have an assessment day of day 1 (ADYC=1). ADYC=1 may or may not be pre-treatment. It is post randomization through midnight of the day of randomization, irrespective of tretment.

Ms Ferreira said this is what should be used:

Needed Quantity	Variable and/or ADYC to Pick
Last ordinal status at moment of or prior to randomization	`ORDSCOR` for `ADYC=Baseline` (identical to `BCSOSN`)
First ordinal status after randomization	`ORSCOR` for `ADYC=1`
Second ordinal status after randomization	`ORDSCOR` for `ADYC=2`
…	…

Ms Ferreira clarified more things on 2021-04-30:

Data collection was done daily while the patient was in the hospital, so the presence of records can almost be used to determine the day of discharge home

Subjects only had ordinal scores completed while hospitalized. The only visits for subjects after discharge are Day 15, 22 and 29 (days post treatment 14, 21 and 28). Day 15 would be the first day we get the non-hospitalized ordinal score for those that discharged prior to Day 15, similar situation for the other days. The data we provided does not have length of hospital stay included. As you noted, in most cases, the time to recovery is the time to discharge (assuming the starting point is randomization and not hospital admission). As you point out, there are subjects that are recovered with OS 3. For these cases, since an ordinal score is required each day, their discharge day may either be the day of their last OS score or the first day without an OS score. Subject COV.00701 has TTR=5 and has assessments on days 1 2 3 4 5 14 21 26 with status 5 4 4 4 4 1 1 1. Day 5 OS of 4 would mean that on Day 5 (4 days after randomization), the subject was hospitalized for at least part of the day needing ongoing medical care, in this particular case the subject was hospitalized all day as the time to recovery of 5 days would mean that the subject was discharged on Day 6. Ordinal Score is reported as the worst status on the day prior, so if a subject was hospitalized needing ongoing medical care at any time that day, their score would be at least a 4. The subject can still be discharged that day.

Recodes and R Data Table Creation

Some recoding was done:

comorbidities ($0, 1, \geq 2$) to 0, 1, 2
age “>89” set to numeric 90

require(Hmisc)

knitrSet(lang='markdown',w=7,h=6)

require(haven)

require(data.table)

require(ggplot2)
s <- read_sas('actt1_longitudinal_ordinal_data_updated.sas7bdat')
d <- upData(s, lowernames=TRUE, moveUnits=TRUE,
    rename=c(usubjid='id', adyc='day', agec='age', comorb2='co', arm='tx',
      or15scor='y15', ordscor='y', bcsosn='y0', bdursymp='dursx'),
    labels=c(age='Age',
             tx = 'Assigned Treatment',
             y0 = 'Baseline Severity',
             y  = 'Ordinal Outcome',
             co = 'Comborbidities',
             day= 'Day'),
    day = ifelse(day == 'Baseline', 0L, as.integer(day)),
    age = ifelse(age == '>89', 90, as.numeric(age)),
    co  = c('No Comorbidities'=0, '1 Comorbidity'=1,
            '2 or more Comorbidities'=2, 'Unknown'=NA)[co]
)

Input object size:   2146808 bytes;  15 variables    17210 observations
Label for d29dthe0 changed from Time to Death (days) to Time to Death 
    units set to days 
Label for d29dthe1 changed from Time to Death Censor Time Point (days) to Time to Death Censor Time Point 
    units set to days 
Label for ttrecov0 changed from Time to Recovery (days) to Time to Recovery 
    units set to days 
Label for ttrecov1 changed from Time to Recovery Censor Time Point (days) to Time to Recovery Censor Time Point 
    units set to days 
Label for agec changed from Age (years) to Age 
    units set to years 
Label for bdursymp changed from Duration of Symptoms (days) to Duration of Symptoms 
    units set to days 
Renamed variable     usubjid    to id 
Renamed variable     adyc   to day 
Renamed variable     agec   to age 
Renamed variable     comorb2    to co 
Renamed variable     arm    to tx 
Renamed variable     or15scor   to y15 
Renamed variable     ordscor    to y 
Renamed variable     bcsosn     to y0 
Renamed variable     bdursymp   to dursx

Modified variable   day

Modified variable   age
Modified variable   co
New object size:    1527040 bytes;  15 variables    17210 observations

setDT(d, key=c('id', 'day'))
d[, table(actarm, tx)]

                     tx
actarm                Placebo Remdesivir
  Not Treated               4         10
  Placebo                8130          0
  Remdesivir               29       8278
  Unplanned Treatment     730         29

d[day == 0, sum(is.na(y0 + y))]

[1] 0

d[day == 0, table(y0, y, useNA='ifany')]   # BCSOSN identical to ADYC=Baseline ORDSCOR

   y
y0    4   5   6   7
  4 138   0   0   0
  5   0 435   0   0
  6   0   0 193   0
  7   0   0   0 285

Look at data for those 11 patients who will later be deleted.

d[is.na(day), ]

           id         tx      actarm y0 d29dthe0 d29dthe1 ttrecov0 ttrecov1 day
 1: COV.00773 Remdesivir Not Treated NA       NA        0       NA        0  NA
 2: COV.00855 Remdesivir Not Treated NA       NA        0       NA        0  NA
 3: COV.01173    Placebo Not Treated NA       NA        0       NA        0  NA
 4: COV.01263 Remdesivir Not Treated NA       NA        0       NA        0  NA
 5: COV.01311 Remdesivir Not Treated NA       NA        0       NA        0  NA
 6: COV.01435    Placebo Not Treated NA       NA        0       NA        0  NA
 7: COV.01619 Remdesivir Not Treated NA       NA        1       NA        1  NA
 8: COV.01727    Placebo Not Treated NA       NA        0       NA        0  NA
 9: COV.01804 Remdesivir Not Treated NA       NA        1       NA        0  NA
10: COV.01806 Remdesivir Not Treated NA       NA        0       NA        0  NA
11: COV.01807 Remdesivir Not Treated NA       NA        1       NA        1  NA
     y y15 age sex dursx co
 1: NA   7  30   M    16 NA
 2: NA   7  41   F     6 NA
 3: NA   7  65   F     8  1
 4: NA   7  54   F     8 NA
 5: NA   7  49   F    12 NA
 6: NA   7  61   M    14 NA
 7: NA   7  74   F    14 NA
 8: NA   7  71   M     7 NA
 9: NA   7  65   M     9 NA
10: NA   7  51   M     1 NA
11: NA   7  76   M    12 NA

Check agreement between baseline and adyc=1 status and between adyc=1 status and adyc=2 status.

w <- d[day %in% 0:2, .(id, day, y)]
w <- dcast(w, id ~ paste0('y', day), value.var='y')
w[, table(y0, y1, useNA='ifany')]

   y1
y0    3   4   5   6   7 <NA>
  4   1 104  20   3   1    9
  5   0   8 331  33  22   41
  6   0   3  13 145  16   16
  7   0   0   3   3 245   34

w[, table(y1, y2, useNA='ifany')]

      y2
y1       1   3   4   5   6   7   8 <NA>
  3      0   1   0   0   0   0   0    0
  4      0   1  91  14   2   0   0    7
  5      2   1  26 308  16   6   2    6
  6      0   0   1  18 138  24   2    1
  7      0   0   0   1   5 273   4    1
  <NA>   0   0  10  35  13  36   1    5

Check missingness patterns for early days. There were 13 patients with no record for (originally numbered) days 2-5. These seem to be patients going home early with large gaps, as shown in the listing below. Note that days 1-5 refer to patient status on day 0-4.

days <- 1 : 5
w <- d[day %in% days, .(id, day, y)]
w <- dcast(w, id ~ day, value.var='y')
z <- rep('', nrow(w))
for(i in days) z <- paste0(z, 1 * is.na(w[[as.character(i)]]))
table(z)

z
00000 00001 00010 00011 00110 00111 01000 01010 01111 10000 10001 10010 10011 
  760    74     6    46     8    42     1     1    13    88     2     1     4

# combplotp(~ is.na(y1) + is.na(y2) + ..., data=w)
idm <- w$id[z == '01111']
k <- d[id %in% idm, .(id, day, y, ttrecov0, ttrecov1)]
k

           id day y ttrecov0 ttrecov1
 1: COV.00735   0 5        1       NA
 2: COV.00735   1 5        1       NA
 3: COV.00735  15 1        1       NA
 4: COV.00735  22 1        1       NA
 5: COV.00735  29 1        1       NA
 6: COV.00757   0 6       NA        0
 7: COV.00757   1 6       NA        0
 8: COV.00788   0 6        1       NA
 9: COV.00788   1 4        1       NA
10: COV.00788  15 2        1       NA
11: COV.00788  22 1        1       NA
12: COV.00788  28 1        1       NA
13: COV.00952   0 5        1       NA
14: COV.00952   1 5        1       NA
15: COV.00952  14 1        1       NA
16: COV.00952  20 1        1       NA
17: COV.00952  28 1        1       NA
18: COV.01003   0 5        1       NA
19: COV.01003   1 5        1       NA
20: COV.01003  14 1        1       NA
21: COV.01003  21 1        1       NA
22: COV.01003  25 1        1       NA
23: COV.01093   0 4        1       NA
24: COV.01093   1 4        1       NA
25: COV.01093  14 1        1       NA
26: COV.01093  21 1        1       NA
27: COV.01093  27 1        1       NA
28: COV.01123   0 4        1       NA
29: COV.01123   1 4        1       NA
30: COV.01123  15 2        1       NA
31: COV.01123  22 1        1       NA
32: COV.01123  29 1        1       NA
33: COV.01139   0 4        1       NA
34: COV.01139   1 4        1       NA
35: COV.01139  21 1        1       NA
36: COV.01139  33 2        1       NA
37: COV.01209   0 5        1       NA
38: COV.01209   1 5        1       NA
39: COV.01209  17 1        1       NA
40: COV.01209  22 1        1       NA
41: COV.01209  31 1        1       NA
42: COV.01309   0 5        1       NA
43: COV.01309   1 5        1       NA
44: COV.01309  16 1        1       NA
45: COV.01309  19 1        1       NA
46: COV.01309  30 1        1       NA
47: COV.01319   0 4        1       NA
48: COV.01319   1 4        1       NA
49: COV.01319  14 1        1       NA
50: COV.01319  19 1        1       NA
51: COV.01319  30 2        1       NA
52: COV.01572   0 7       NA        0
53: COV.01572   1 7       NA        0
54: COV.01666   0 4        1       NA
55: COV.01666   1 4        1       NA
56: COV.01666  12 1        1       NA
57: COV.01666  22 1        1       NA
58: COV.01666  28 1        1       NA
           id day y ttrecov0 ttrecov1

idm <- w$id[substring(z, 1, 1) == '1']
k <- d[id %in% idm, .(id, day, y, ttrecov0, ttrecov1)]
# View(k)

Drop the 11 subjects with missing day.

d <- d[! is.na(day), ]

# For each subject create a sequential record number
d[, seq := 0 : (.N - 1), by=id]
d[day < 2, table(day, seq, useNA='ifany')]

   seq
day    0    1
  0 1051    0
  1    0  951

# Check treatment counts vs Table 2 in paper
d[day == 0, table(tx)]  # doesn't match: get 518 (missing 3) 533 (missing 8)

tx
   Placebo Remdesivir 
       518        533

# The 11 missing are the ones described above

Compute day of death (NA if not died).

# Compute day of death
d[, ddeath := if(any(y == 8)) min(day[y == 8]) else NA_integer_, by=id]
# View(d[, .(id, seq, day, d29dthe0, ddeath, y, y15)])
d[, sum(day > ddeath, na.rm=TRUE)]

[1] 449

# Remove records where death was carried forward
dim(d)

[1] 17199    17

d <- d[day <= ddeath | is.na(ddeath), ]
dim(d)

[1] 16750    17

# Check death count against Table 2 in paper
d[seq == 0, table(tx, ! is.na(ddeath))]

            
tx           FALSE TRUE
  Placebo      441   77
  Remdesivir   473   60

d[, table(y == 8)]   # OK: 137 vs. 136 death by day 29


FALSE  TRUE 
16613   137

# Check distribution of y on day 15 vs paper
d[seq == 0,  table(y15, tx)]  # matches except missing 11 with y=7

   tx
y15 Placebo Remdesivir
  1     115        157
  2     102        117
  3       8         14
  4      33         38
  5      60         58
  6      24         28
  7     118         87
  8      58         34

d[day == 15, table(y,   tx)]  # only 1/2 of patients have day 15 records

   tx
y   Placebo Remdesivir
  1      27         37
  2      27         21
  3       8         10
  4      32         36
  5      45         45
  6      23         17
  7     109         79
  8       4          1

# Compare to time to death variable
d[, table(d29dthe0 - ddeath)]


  -5   -4   -1 
   6   23 1676

d[, sum(d29dthe0 - ddeath != -1, na.rm=TRUE)]

[1] 29

w <- d[d29dthe0 - ddeath != -1, .(id, day, y, d29dthe0)]    # 3 subjects, 29 records
print(w)

           id day y d29dthe0
 1: COV.01144   0 6       24
 2: COV.01144   1 6       24
 3: COV.01144   2 6       24
 4: COV.01144   3 7       24
 5: COV.01144   4 7       24
 6: COV.01144  29 8       24
 7: COV.01204   0 4       18
 8: COV.01204   1 5       18
 9: COV.01204   2 4       18
10: COV.01204   3 4       18
11: COV.01204   4 5       18
12: COV.01204   5 5       18
13: COV.01204  12 2       18
14: COV.01204  22 8       18
15: COV.01522   0 7       18
16: COV.01522   1 7       18
17: COV.01522   2 7       18
18: COV.01522   3 7       18
19: COV.01522   4 7       18
20: COV.01522   5 7       18
21: COV.01522   6 7       18
22: COV.01522   7 7       18
23: COV.01522   8 7       18
24: COV.01522   9 7       18
25: COV.01522  10 7       18
26: COV.01522  11 7       18
27: COV.01522  12 7       18
28: COV.01522  13 7       18
29: COV.01522  22 8       18
           id day y d29dthe0

Check that there is only one death record per subject and that when there is a death record it is always the last record. Compute last day in hospital.

# Tabulate number of death records per patient
d[, .(n8=sum(y == 8)), by=id][, table(n8)]

n8
  0   1 
914 137

d[y == 8, .(fromlast=max(day) - day), by=id][, table(fromlast)]

fromlast
  0 
137

# Compute last day known to be in the hospital at least part of that day
d[, dlasthosp := max(day[y >= 3 & y < 8]), by=id]

# Some patients un-recovered and had to be readmitted.  For those dlasthosp is the last day recorded in hospital.
# Compute dlasthospf which is the last day in the first hospital admission
# To compute this first compute dhome = first day with home status

d[, dhome      := min(day[y < 3]), by=id]   # NA if no at home records

d[, dlasthospf := max(day[y >= 3 & y < 8 & (is.na(dhome) | day < dhome)]), by=id]
z <- d[is.na(dlasthosp) | is.na(dlasthospf) | dlasthosp != dlasthospf, .(id, day, y, dhome, dlasthosp, dlasthospf)]
# View(z)
# Count number of patients who unrecovered
d[seq == 1, sum(dlasthosp != dlasthospf)]

[1] 28

Count the number of patients by treatment category, and check this against baseline records. Also check that every patient has only one treatment code.

d[, .(n=length(unique(id))), by=tx]

           tx   n
1:    Placebo 518
2: Remdesivir 533

d[day == 0, .N, by=tx]

           tx   N
1:    Placebo 518
2: Remdesivir 533

table(d[, .(n=length(unique(tx))), by=id]$n)


   1 
1051

Examine gaps in longitudinal measurements. Check that death is an absorbing state.

d[, gap := ifelse(day == 0, 0, day - shift(day)), by=id]
# Get distribution of gaps before and after going home the first time
w <- d[, .(athome = day > dlasthospf, gap)]
w[, table(gap, athome)]

    athome
gap  FALSE  TRUE
  0   1051     0
  1  13556   289
  2     99    58
  3      2    53
  4      2   109
  5      4   143
  6      2   245
  7      2   506
  8      2   229
  9      1   110
  10     0   109
  11     0    52
  12     0    54
  13     0    32
  14     0    20
  15     0     7
  16     0     2
  17     0     5
  19     0     1
  20     0     1
  21     0     1
  22     0     1
  25     0     1
  29     0     1

# ggplot(d, aes(x=gap)) + geom_histogram()
d[, ggplotlyr(ggfreqScatter(day, gap, by=y))]

w <- d[day > 0 & gap == 0, .(id, day, gap)]
w

Empty data.table (0 rows and 3 cols): id,day,gap

d[id %in% w$id, .(id, day)]

Empty data.table (0 rows and 2 cols): id,day

w <- d[y == 8, .(dgap = .N - (seq + 1))]
w[, table(dgap)]

dgap
105 107 109 110 111 112 113 114 116 117 118 119 120 121 122 123 124 125 126 127 
  1   2   2   3   1   3   3   2   6   4   2   4   4   4  12   6   6   7   3   3 
128 129 130 131 132 133 134 135 
  5  10  13   8   7   6   9   1

Check day 15 status vs. nearby daily status while in hospital. You can see that there are 15 patients with day 15 y=5 and the record on day 15 has y=4.

d[seq == 0, table(day <= dlasthosp)]


TRUE 
1051

# Get patients not dying or discharged to home before day 16
w <- d[(is.na(ddeath) | ddeath >= 16) & dlasthosp > 16, ]
w[seq == 0, .N]

[1] 365

w[day %in% 13:17, .N, by=day]

for(i in 13 : 17) {
  cat('\nDay', i, 'since randomization\n')
  print(w[day == i, table(y15, y)])
}


Day 13 since randomization
   y
y15   2   3   4   5   6   7
  3   0   7   2   0   0   0
  4   0   0  34   6   0   0
  5   1   1   1  48  13   9
  6   0   0   0   3  31   6
  7   0   0   0   0   4 185

Day 14 since randomization
   y
y15   1   2   3   4   5   6   7
  1   3   0   0   0   0   0   0
  2   0   3   0   0   0   0   0
  3   0   0   9   0   0   0   0
  4   0   0   0  42   0   0   0
  5   0   0   0   0  68   2   2
  6   0   0   0   0   1  39   0
  7   0   0   0   0   0   0 189

Day 15 since randomization
   y
y15   2   3   4   5   6   7
  2   2   0   0   1   0   0
  3   0   8   1   0   0   0
  4   0   1  40   1   0   0
  5   0   1   8  58   3   1
  6   0   1   1   9  25   3
  7   0   0   0   2   9 178

Day 16 since randomization
   y
y15   2   3   4   5   6   7
  2   1   0   0   1   0   0
  3   0   8   1   0   0   0
  4   0   0  36   4   0   0
  5   0   2  13  52   3   1
  6   0   1   1  14  21   2
  7   0   0   0  10  14 165

Day 17 since randomization
   y
y15   2   3   4   5   6   7
  2   1   0   1   0   0   0
  3   0   8   0   1   0   0
  4   0   2  36   2   0   0
  5   0   3  21  44   2   1
  6   0   0   2  16  19   2
  7   0   1   0  20  15 153

# View(w[,.(id, day, y, y15, gap)])

Get the frequency distribution of the day 15 status variable for those patients going home before day 15 and also compute the day of the patient’s last record.

w <- d[dlasthospf <= 15, .(y15=y15[1], lastday=max(day, na.rm=TRUE)), by=id]
w[, table(lastday, y15, useNA='ifany')]

       y15
lastday  1  2  3  4  5  6  7  8
     0   0  0  0  0  4  1  0  0
     1   0  0  0  0  0  1  1  0
     2   0  1  0  0  0  2  0  9
     3   0  1  0  0  1  2  1  7
     4   0  1  0  0  1  0  1  6
     5   0  0  0  0  0  1  0  6
     6   0  1  0  0  0  0  1 14
     7   0  0  0  1  1  0  0  9
     8   0  0  0  0  1  0  0  4
     9   0  0  0  0  0  0  0  4
     10  0  1  0  0  0  0  0  4
     11  0  0  0  0  0  0  1  7
     12  0  0  0  0  0  0  0  4
     13  2  0  0  0  0  0  0  6
     14  0  0  1  1  0  0  0 11
     15  1  0  0  2  0  1  5  1
     16  0  1  0  1  1  0  1  0
     18  0  1  0  0  0  0  0  0
     19  1  0  0  0  0  0  0  0
     21  2  1  0  0  0  0  0  0
     22  0  1  0  0  0  0  1  0
     23  1  0  0  0  0  0  0  0
     24  1  0  0  0  0  0  1  0
     25 11  9  1  2  0  1  0  0
     26 13 16  1  1  2  0  0  0
     27 31 28  0  3  8  0  0  0
     28 94 68  3 13 10  1  0  0
     29 62 37  0  3  5  0  1  0
     30 30 27  1  2  1  0  0  0
     31 16 15  2  1  2  1  0  0
     32  4  6  0  0  0  0  0  0
     33  3  4  0  0  0  0  0  0

Check that alive patients almost always have daily ordinal assessments while in hospital (y > 2).

d[day > 0 & day <= dlasthosp & (is.na(ddeath) | day < ddeath), table(gap)]

gap
    1     2     3     4     5     6     7     8     9    10    11 
13691   102     2    12     8    10    11     7     3     1     3

Check censoring day against last day with an ordinal assessment. Last contact day for vital status censoring purposes is d29dth1 + 1.

d[, dcens := pmax(d29dthe0, d29dthe1, ttrecov0 + 1, ttrecov1 + 1, na.rm=TRUE)]
# View(d[, .(id, day, y, ddeath, dcensd)])
z <- d[day <= 28, .(maxday = max(day), dcens=dcens[1]), by=.(id, nad=is.na(d29dthe0))]
z[, table(nad, maxday - dcens)]

       
nad     -27 -26 -25 -24 -23 -22 -21 -20 -19 -18 -17 -16 -15 -14 -13 -12 -11 -10
  FALSE   9   7   7   6  14   9   4   4   4   7   4   6  10   5   3   5   2   4
  TRUE    0   0   0   1   0   0   0   0   0   0   0   0   0   2   4   3   3   7
       
nad      -9  -8  -7  -6  -5  -4  -3  -2  -1   0   1   4
  FALSE   5   0   2   4   3   0   4   2   2   0   3   1
  TRUE   10   9  72  85  62  28  34  15  54 524   2   0

# 4 pts with maxday > day of death
# Take daily assessments to current max day or dcens whichever is greater

For days between last ordinal assessment in the first hospital admission and death (if any), we know the patient is alive and at home. Add daily records to indicate this, with y=1.5 to indicate “at home, activity limits and O2 status unknown”. Stop at day 29. Note that when a patient is kept in the hospital for infection control (y=3) they still get daily outcome assessments. If the patient has a missing y at day=1 set y to the baseline state.

# First create a one row per patient data table with information that doesn't change over days
u <- d[seq == 0, .(id, tx, ddeath, dhome, dlasthosp, dlasthospf, d29dthe0, d29dthe1, actarm, y0, ttrecov0, ttrecov1, y15,
                   age, sex, dursx, co, dcens)]
setkey(u, 'id')
# Create a dataset with everyone having 28d of data
w <- expand.grid(id=unique(d$id), day=1:28)
setDT(w, key=c('id', 'day'))
# Next line is same as w <- merge(w, d[, .(id, day, y)], by=c('id', 'day'), all.x=TRUE)
w <- d[w, .(id, day, y), nomatch=NA]
w <- u[w, nomatch=NA]
w[is.na(y) & day == 1, y := y0]
w[, y := as.numeric(y)]
w[is.na(y) & day > dlasthospf & day <= dcens & (is.na(ddeath) | day < ddeath), y := 1.5]
w <- w[! is.na(y), ]
# Look at 10 randomly chosen patients
ids <- sort(sample(w$id, 10))
if(FALSE) for(i in ids) {
  xless(d[i, .(day, y, dhome, ddeath, dlasthosp, dlasthospf, dcens)], title=i)
  xless(w[i, .(day, y, dhome, ddeath, dlasthosp, dlasthospf, dcens)], title=i)
  readLines(n=1)
  system('pkill xless')
}
setkey(w, id, day)
d <- w

Add the previous state to each record, and describe the new analysis file.

d[, yprev := ifelse(day == 1, y0, shift(y)), by=id]
d[, gap   := ifelse(day == 1, 1, day - shift(day)), by=id]
# Remove baseline records
d <- d[day > 0, ]
# Add sequential record number
d[, seq := (1 : .N), by=id]
d <- upData(d,
            labels=c(yprev     = 'Previous Outcome Status',
                     day       = 'Day',
                     seq       = 'Sequential Record Number',
                     ddeath    = 'Day of First Record With Status = Death',
                     dhome     = 'First day with status at home (NA if never)',
                     dlasthosp = 'Last day in hospital with outcome assessment',
                     dlasthospf= 'Last day in first hospital admission with outcome assessment',
                     dcens     = 'Day of censoring for TTR/death',
                     gap       = 'Days Since Previous Record',
                     y0        = 'Baseline Status',
                     y         = 'Outcome Status'))

Input object size:   3326680 bytes;  23 variables    26167 observations
New object size:    3122288 bytes;  23 variables    26167 observations

html(describe(d))

d

23 Variables 26167 Observations

n	missing	distinct
26167	0	1051

lowest :	COV.00701	COV.00702	COV.00703	COV.00704	COV.00705
highest:	COV.01802	COV.01803	COV.01805	COV.01808	COV.01809

tx: Assigned Treatment Format:$

n	missing	distinct
26167	0	2

 Value         Placebo Remdesivir
 Frequency       12743      13424
 Proportion      0.487      0.513

ddeath: Day of First Record With Status = Death

n	missing	distinct	Info	Mean	Gmd	.05	.10	.25	.50	.75	.90	.95
1674	24493	27	0.997	16.93	8.534	5	6	11	17	23	27	29

lowest : 2 3 4 5 6 , highest: 25 26 27 29 31

dhome: First day with status at home (NA if never)

n	missing	distinct	Info	Mean	Gmd	.05	.10	.25	.50	.75	.90	.95
19070	7097	32	0.974	17.25	5.53	12	13	14	15	21	27	28

lowest : 2 4 5 6 7 , highest: 30 31 32 33 40

dlasthosp: Last day in hospital with outcome assessment

n	missing	distinct	Info	Mean	Gmd	.05	.10	.25	.50	.75	.90	.95
26167	0	33	0.997	14.29	10.73	2	3	6	12	24	28	29

lowest : 0 1 2 3 4 , highest: 28 29 30 31 32

dlasthospf: Last day in first hospital admission with outcome assessment

n	missing	distinct	Info	Mean	Gmd	.05	.10	.25	.50	.75	.90	.95
26167	0	33	0.997	13.86	10.56	2	3	5	11	22	28	29

lowest : 0 1 2 3 4 , highest: 28 29 30 31 32

d29dthe0: Time to Death days

n	missing	distinct	Info	Mean	Gmd	.05	.10	.25	.50	.75	.90	.95
1646	24521	26	0.997	15.55	8.174	4	5	10	15	22	25	26

lowest : 1 2 3 4 5 , highest: 23 24 25 26 28

d29dthe1: Time to Death Censor Time Point days

n	missing	distinct	Info	Mean	Gmd	.05	.10	.25	.50	.75	.90	.95
24521	1646	23	0.504	27.45	0.9683	25	26	28	28	28	28	28

lowest : 0 1 2 3 4 , highest: 24 25 26 27 28

actarm: Treatment

n	missing	distinct
26167	0	4

 Value              Not Treated             Placebo          Remdesivir
 Frequency                    4               12017               13421
 Proportion               0.000               0.459               0.513
                               
 Value      Unplanned Treatment
 Frequency                  725
 Proportion               0.028

y0: Baseline Status

n	missing	distinct	Info	Mean	Gmd
26167	0	4	0.898	5.56	1.12

 Value          4     5     6     7
 Frequency   3651 11094  4531  6891
 Proportion 0.140 0.424 0.173 0.263

ttrecov0: Time to Recovery days

n	missing	distinct	Info	Mean	Gmd	.05	.10	.25	.50	.75	.90	.95
20486	5681	33	0.997	10.16	7.918	2	3	4	8	14	21	25

lowest : 0 1 2 3 4 , highest: 28 29 30 31 32

ttrecov1: Time to Recovery Censor Time Point days

n	missing	distinct	Info	Mean	Gmd	.05	.10	.25	.50	.75	.90	.95
5681	20486	18	0.859	27.1	1.928	25	27	27	27	28	28	30

lowest : 0 1 2 3 4 , highest: 25 26 27 28 30

 Value          0     1     2     3     4     5     6     7    10    13    14    15
 Frequency      7     6    12    10     6     6    15     8    12    14    29    17
 Proportion 0.001 0.001 0.002 0.002 0.001 0.001 0.003 0.001 0.002 0.002 0.005 0.003
                                               
 Value         24    25    26    27    28    30
 Frequency     50   156   135  2464  2203   531
 Proportion 0.009 0.027 0.024 0.434 0.388 0.093

y15: Ordinal Scale Score Day 15

n	missing	distinct	Info	Mean	Gmd
26167	0	8	0.955	3.597	2.693

lowest : 1 2 3 4 5 , highest: 4 5 6 7 8

 Value          1     2     3     4     5     6     7     8
 Frequency   7479  5934   586  1862  3033  1231  5322   720
 Proportion 0.286 0.227 0.022 0.071 0.116 0.047 0.203 0.028

age: Age

n	missing	distinct	Info	Mean	Gmd	.05	.10	.25	.50	.75	.90	.95
26167	0	68	1	57.98	16.62	32	38	48	59	68	77	82

lowest : 21 22 23 25 26 , highest: 86 87 88 89 90

sex: Sex

n	missing	distinct
26167	0	2

 Value          F     M
 Frequency   9309 16858
 Proportion 0.356 0.644

dursx: Duration of Symptoms days

n	missing	distinct	Info	Mean	Gmd	.05	.10	.25	.50	.75	.90	.95
26083	84	35	0.995	9.865	5.669	3	4	6	9	12	17	20

lowest : 0 1 2 3 4 , highest: 30 31 33 34 46

co: Comborbidities

n	missing	distinct	Info	Mean	Gmd
26082	85	3	0.816	1.348	0.8082

 Value          0     1     2
 Frequency   5035  6931 14116
 Proportion 0.193 0.266 0.541

dcens: Day of censoring for TTR/death days

n	missing	distinct	Info	Mean	Gmd	.05	.10	.25	.50	.75	.90	.95
26167	0	29	0.68	27.65	1.254	25	26	28	28	28	29	29

lowest : 1 2 3 4 6 , highest: 29 30 31 32 33

day: Day

n	missing	distinct	Info	Mean	Gmd	.05	.10	.25	.50	.75	.90	.95
26167	0	28	0.999	13.88	9.243	2	3	7	14	21	25	27

lowest : 1 2 3 4 5 , highest: 24 25 26 27 28

y: Outcome Status

n	missing	distinct	Info	Mean	Gmd
26167	0	9	0.916	3.773	2.545

lowest : 1.0 1.5 2.0 3.0 4.0 , highest: 4.0 5.0 6.0 7.0 8.0

 Value        1.0   1.5   2.0   3.0   4.0   5.0   6.0   7.0   8.0
 Frequency    930 10765   536   348  2107  3785  1813  5750   133
 Proportion 0.036 0.411 0.020 0.013 0.081 0.145 0.069 0.220 0.005

yprev: Previous Outcome Status

n	missing	distinct	Info	Mean	Gmd
26167	0	8	0.92	3.875	2.529

lowest : 1.0 1.5 2.0 3.0 4.0 , highest: 3.0 4.0 5.0 6.0 7.0

 Value        1.0   1.5   2.0   3.0   4.0   5.0   6.0   7.0
 Frequency    664 10439   415   336  2207  4170  1987  5949
 Proportion 0.025 0.399 0.016 0.013 0.084 0.159 0.076 0.227

gap: Days Since Previous Record

n	missing	distinct	Info	Mean	Gmd
26167	0	8	0.002	1.002	0.003591

lowest : 1 2 3 4 5 , highest: 4 5 6 7 8

 Value          1     2     3     4     5     6     7     8
 Frequency  26153     3     2     2     4     1     1     1
 Proportion 0.999 0.000 0.000 0.000 0.000 0.000 0.000 0.000

seq: Sequential Record Number

n	missing	distinct	Info	Mean	Gmd	.05	.10	.25	.50	.75	.90	.95
26167	0	28	0.999	13.86	9.231	2	3	7	14	21	25	27

lowest : 1 2 3 4 5 , highest: 24 25 26 27 28

# Visually check key columns
# View(d[, .(id, seq, day, y0, yprev, y)])

saveRDS(d, 'actt1.rds', compress='xz')

Check for Redundancy in Daily Records

Compute all consecutive day transition proportions to see if for some days the ordinal scale is always the same as the previous day. That would indicate carry forward instead of real daily assessment.

ordlabels <- c('not hosp, no limits', 'not hosp', 'not hosp, limits or O2',
  'hosp, no O2, no care', 'hosp, no O2, care',
  'hosp, O2', 'hosp, noninvasive vent', 'hosp, invasive vent', 'death')
require(plotly)

g <- propsTrans(y ~ day + id, data=d, maxsize=2, labels=ordlabels) +
  theme(legend.position='bottom')
ggplotlyr(g)   # ggplotlyr in Hmisc to fix tooltip hovertext

Usage Notes

To read the R file do:

require(Hmisc)
require(data.table)
d <- readRDS('actt1.rds')
contents(d)    # show all the metadata

The records stop at day 28 of follow-up.

The status variables y, yprev, y0, y15 are coded 1, 1.5, 2, 3, …, 8. 1.5 stands for the patient being at home but it is not known whether y=1 or 2 because a daily outcome assessment was not made on that dasy. To turn these variables into factors with value labels do the following:

ordlabels <- c('not hosp, no limits', 'not hosp', 'not hosp, limits or O2',
  'hosp, no O2, no care', 'hosp, no O2, care',
  'hosp, O2', 'hosp, noninvasive vent', 'hosp, invasive vent', 'death')

g <- function(x) factor(x, c(1, 1.5, 2:8), ordlabels)[,drop=TRUE]
d[, y     := g(y)]
d[, y15   := g(y15)]
d[, y0    := g(y0)]
d[, yprev := g(yprev)]
# To do all at once:
d[, c('y', 'y15', 'y0', 'yprev') := list(g(y), g(y15), g(y0), g(yprev))]

The drop=TRUE will force y0 or yprev to only take on the 4 observed levels. Note that with data.table, := means “define a new variable and store it in-place in the existing data table”.

Note that y15 is the NIAID-provided day 15 status variable, which doesn’t always match up with the outcome status in the day 15 record.

This dataset excludes 11 patients for whom baseline severity wasn’t collected and who immediately dropped out of the study.

Since this data table does not carry death forward, you need to do the following to empirically look at state occupancy proportions which require death to be carried forward. Here we carry death forward to day 28.

nrow(d)   # rows before carrying forward

[1] 26167

w <- d[, .(dlast=day, last8 = day == max(day) & y == 8, tx=tx), by=id]
w <- w[last8 == TRUE, .(day = (dlast + 1) : 28,
                        y   = rep(8, 28 - dlast), tx=tx),
             by=id]
label(w$day) <- label(d$day)
label(w$y)   <- label(d$y)
w <- rbind(d[, .(id, day, y, tx)], w)
setkey(w, id, day)
nrow(w)   # rows after carrying forward

[1] 28326

The following also works and is a bit more elegant. For more data.table coding options see this.

u <- d[, .SD[(day == max(day) & y == 8),], by=id,
         .SDcols=c('day', 'tx', 'y')]
u <- u[, .(day = (day + 1) : 28, y=rep(8, 28 - day), tx=tx), by=id]
label(u$day) <- label(d$day)
label(u$y)   <- label(d$y)
u <- rbind(d[, .(id, day, y, tx)], u)
setkey(u, id, day)
identical(w, u)

[1] TRUE

As an alternative, the following fills in missing days before death with y=1.5 (home, unspecified) and carries death forward to day 28. This is useful for descriptive state occupancy statistics. For patients with missing day 1 records assume y is the baseline state. Don’t fill-in missing y for non-deaths with 1.5 at day last known to be in hospital (dlasthosp). But don’t use this method since the 1.5s have already been filled in in the master analysis file.

# First create a one row per patient data table with information that doesn't change over days
u <- d[seq == 1, .(id, tx, y0, ddeath, dlasthosp, ttrecov0, ttrecov1)]
setkey(u, 'id')
# Create a dataset with everyone having 28d of data
w <- expand.grid(id=unique(d$id), day=1:28)
setDT(w, key=c('id', 'day'))
# Next line is same as w <- merge(w, d[, .(id, day, y)], by=c('id', 'day'), all.x=TRUE)
w <- d[w, .(id, day, y), nomatch=NA]
w <- u[w, nomatch=NA]
w[, y := ifelse(! is.na(y), y,
                ifelse(day == 1, y0,
                 ifelse(is.na(ddeath), 1.5, 8 * (day >= ddeath))))]
ids <- c('COV.00701', 'COV.01793', 'COV.01006', 'COV.01144')
d[id %in% ids, .(id, day, y, tx)]
# xless(w[id %in% ids, ], topn=400)
label(w$day) <- label(d$day)
label(w$y)   <- label(d$y)
dcf <- w

Reconstruct Time to Recovery

Compute for each patient the number of days until y < 4. Death counts as non-recovery at day 28 per ACTT-1.

u <- d[, .(ttr = if(any(y == 8) || ! any(y < 4)) 28L else min(day[y < 4]),
           ttro= ifelse(is.na(ttrecov0), ttrecov1, ttrecov0)[1],
           ttrecov0=ttrecov0[1], ttrecov1=ttrecov1[1],
                     y0=y0[1]),
             by=.(id, tx)]
u[, .(Mean=mean(ttr), Median=median(ttr)), by=tx]

           tx     Mean Median
1:    Placebo 17.27992     17
2: Remdesivir 14.80488     12

u[, .(Mean=mean(ttr), Median=median(ttr)), keyby=.(y0, tx)]

   y0         tx     Mean Median
1:  4    Placebo 10.12698      7
2:  4 Remdesivir  8.16000      6
3:  5    Placebo 14.06897     10
4:  5 Remdesivir 11.05172      8
5:  6    Placebo 18.67347     20
6:  6 Remdesivir 17.72632     17
7:  7    Placebo 23.55195     28
8:  7 Remdesivir 23.13740     28

ggplot(u, aes(x=ttr)) + geom_histogram() + facet_grid(y0 ~ tx) +
  xlab('Time to Recovery Using Daily States') + ylab('')

u[, .(Mean=mean(ttro), Median=median(ttro)), by=tx]

           tx     Mean Median
1:    Placebo 15.91699   14.5
2: Remdesivir 13.28143    9.0

u[, .(Mean=mean(ttro), Median=median(ttro)), keyby=.(y0, tx)]

   y0         tx      Mean Median
1:  4    Placebo  8.523810    6.0
2:  4 Remdesivir  6.680000    5.0
3:  5    Placebo 12.379310    9.0
4:  5 Remdesivir  9.469828    7.0
5:  6    Placebo 18.142857   19.5
6:  6 Remdesivir 15.863158   12.0
7:  7    Placebo 22.188312   26.0
8:  7 Remdesivir 21.938931   26.0

u[, ggfreqScatter(ttrecov0, ttr)]
u[, ggfreqScatter(ttrecov1, ttr)]
u[, ggfreqScatter(ttro, ttr)]
i <- u[, .(id=id, susp=! is.na(ttrecov0) & abs(ttr - ttrecov0) > 2)]
# i <- u[, .(id=id, susp=abs(ttr - ttro) > 2)]
table(i$susp)


FALSE  TRUE 
 1030    21

# Get raw data for those with large differences
ttri        <- u$ttr
names(ttri) <- u$id
idsusp <- i[susp == TRUE, id]
z <- d[id %in% idsusp, .(id, day, y, ttr=ttri[id], ttrecov0, ttrecov1, ddeath, dlasthosp)]
# View(z)

Computing Environment

 R version 4.1.0 (2021-05-18)
 Platform: x86_64-pc-linux-gnu (64-bit)
 Running under: Pop!_OS 20.10
 
 Matrix products: default
 BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
 LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
 
 attached base packages:
 [1] stats     graphics  grDevices utils     datasets  methods   base     
 
 other attached packages:
 [1] plotly_4.9.3      data.table_1.13.0 haven_2.3.1       Hmisc_4.6-0      
 [5] ggplot2_3.3.2     Formula_1.2-3     survival_3.2-11   lattice_0.20-44

To cite R in publications use:

R Core Team (2021). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/.

To cite the Hmisc package in publications use:

Harrell Jr F (2021). Hmisc: Harrell Miscellaneous. R package version 4.6-0, https://hbiostat.org/R/Hmisc/.

Creation of ACTT1 Analysis File

Frank Harrell
CONNECTS VUMC
Study Design and Analysis Core 5
Department of Biostatistics
Vanderbilt University School of Medicine
VICTR Research Methods Program

2021-05-21

Background

Recodes and R Data Table Creation

Check for Redundancy in Daily Records

Usage Notes

Reconstruct Time to Recovery

Computing Environment

Creation of ACTT1 Analysis File

Frank Harrell CONNECTS VUMC Study Design and Analysis Core 5 Department of Biostatistics Vanderbilt University School of Medicine VICTR Research Methods Program

2021-05-21

Background

Recodes and R Data Table Creation

Check for Redundancy in Daily Records

Usage Notes

Reconstruct Time to Recovery

Computing Environment

Frank Harrell
CONNECTS VUMC
Study Design and Analysis Core 5
Department of Biostatistics
Vanderbilt University School of Medicine
VICTR Research Methods Program