The purpose of the code below is to create an annotated ready-to-analyze R data table from the NIH NIAID ACTT-1 COVID-19 remdesivir randomized clinical trial. Data were provided by NAIAD and EMMES Corp. to the Vanderbilt Department of Biostatistics COVID-19 CONNECTS team under a data transfer agreement, and those outside the team must make a request to NIAID to receive the SAS file. Then they can run the code below to produce an R file.
The ACTT1 final report is here.
Much thanks to Tyler Bonnett for providing guidance about the use of the source dataset. From Tyler regarding 11 records with blanks for ADYC
in the dataset.
There were 11 ACTT-1 participants who were randomized but missing a baseline ordinal score assessment (and all subsequent assessments – these patients left the study very early and were censored for analyses). They show up in the dataset because they were included in the primary ITT analysis which stratified by baseline disease severity (mild-moderate vs severe) rather than baseline ordinal score.
The 11 patients all had day 15 ordinal outcome level 7. Why? Response 2021-04-13 from Jennifer Ferreira:
If a subject is missing an ordinal score at the Day 15 Study Visit, then an ordinal score is imputed for Day 15. The following imputations were used:
- Subject is randomized but has no ordinal score data available for any timepoint (pre- or post-randomization): ordinal score = 7
- Subject dies prior to study day 15: ordinal score = 8
- Subject is discharged from the hospital not due to death or early termination and does not return for the Day 15 visit: ordinal score = 2
- Subject is terminated from the study while hospitalized (termination not due to death) prior to Day 15 assessment: ordinal score = last observed ordinal score
- Subject is discharged from the hospital prior to Day 15 assessment and discontinues the study on the same day as discharge: ordinal score = last observed ordinal score
- Subject is discharged and readmitted to the hospital prior to Day 15 assessment and was still hospitalized at study day 15: ordinal score = 7
The SAS dataset was updated on 2021-03-25 to include the as-randomized ITT codes (variable arm
). The original as-treated variable is actarm
. Tyler Bonnett provided the following notes on the new dataset:
A reviewer of the January 29, 2021 dataset asked about multiple ordinal score values on the same day for two subjects:
Subject COV.01501 had their Day 15 visit at Day 14. The Day 22 visit was missed but the site entered the Day 14 status under Day 22 except with unknown oxygen usage. The data submitted under Day 15 visit is likely the most accurate. The data entered under the Day 22 visit should be ignored for analysis purposes.
Subject COV.01198 had their Day 25 visit when re-admitted. The Day 22 visit was missed but the site entered the Day 25 status under Day 22 except with no known oxygen usage. The data submitted under Day 25 visit is likely the most accurate. The data entered under the Day 22 visit should be ignored for analysis purposes.
Both of the extra data points submitted under the Day 22 visit for COV.01501 and COV.01198 have been removed from the updated dataset for clarity.
On 2021-03-19, Jennifer Ferreira provided some answers to earlier data questions:
I see some records past day 30, going all the way up to day 40. What’s the scoop on those?
Some subjects had their Day 29 visit late.
I see many patients with no day 15 record but with the day 15 status (the separate variable OR15SCOR) present, and sometimes the day 15 status does not equal the ordinal outcome status ORDSCOR from the most recent day (e.g., day 13). Can you say something about that?
In some cases the data is imputed based on their hospitalization status. E.g., if a subject’s last score was at Day 3, discharged on Day 4 and missed the Day 15 visit, their score would be imputed as a 2 (not hospitalized with limitations/home oxygen use).
Mat Makowski provided additional information on 2021-04-07:
While in general the day 15 evaluation is at
ADYC=14
there are subjects who had drug dosing delayed by a day. These subjects would have their day 15 evaluation atADYC=15
since that is 14 days after drug receipt. This is also the number difference you see forADYC=1
. Subjects who had drug delayed would start atADYC=2
and not have an observation forADYC=1
.
Jennifer Ferreira provided additional information on 2021-04-13 and 2021-04-14:
The Day 15 ordinal score is collected anywhere from Day 13 to Day 15. Collection at Day 13 frequently occurred among those that were hospitalized and discharged prior to the Day 15 visit but within window so they did not come back for a follow-up visit. The Day 15 visit would assess the subject the day prior; however, the day as calculated would be based on the assessment date, e.g., Day 14 in the longitudinal data.
The time to death variable D29DTHE0 is one day off of the follow-up day (ADYC) at which death is first noted. Would you mind saying something about that?
The time to death is a direct calculation while the study day is the day +1. For example, if a subject enrolled on March 13th but died on March 19th their time to death would be 6 days but it would be on Study Day 7.
There are two patients where the time to death is more than one day off of the first longitudinal record noting death: COV.01522 and COV.01144.
Death was not always entered through the ordinal score form; however, it was always entered as an AE in the data system. The AE data was used to impute scores but scores were only imputed for days where we anticipated having an ordinal score. For example, if a subject died on Day 18, the death would be imputed on Day 22 as that is all we needed for the planned analyses.
On 2021-04-02 Jennifer Ferreira responded to additional questions:
I see on patient COV.00712 that on the 7th record for the patient, starting with record 1 = baseline = study day 0 (pre-randomization), the patient died and this is on study day 6. But d29dthe0 is 5. The patient died 6 days from time zero so how do you have a 5 for d29dthe0? I must be missing something simple, or the dataset uses an unconventional time origin. Perhaps death is considered to occur at midnight at the start of each day? So a death on day 1 (first day of assigned treatment) is counted as not surviving through noon of day 1 so is rounded down to day 0?
If a subject died the same day as randomization then they would be included in the analysis as dying at time 0 days from randomization as a full day hadn’t elapsed from randomization to the time of death. For the example subject, COV.00712, the subject died on Study Day 6 (with randomization Day as Day 1) which was 5 days after randomization. Please see the table below to help explain.
The outcome of the proportional hazards model comparisons is the same with or without the one day added in. However, median estimates from the model may differ by one day. Similarly, the RMST estimates would increase as well.
I want to confirm that after death, all dead patients have y=8 carried forward in this dataset, to the last planned follow-up day.
Deaths were imputed for select time periods for summaries by day, Days 3, 5, 8, 11, 15, 22 and 29. They were not carried forward every day through Day 29.
Mat Makowski answered the following on 2021-04-05:
From the study design, data collection was highlighted for days 3, 5, 8, 11, 15, 22, 29. Would you mind confirming that the ordinal outcomes were by and large really collected independently on all the consecutive days until hospital discharge, i.e., were not routinely carried forward for days not in the list above?
Yes you are correct that outside of the visits listed data points were not routinely carried forward and those are the values reported by the site. I hope this helps.
The time variable (SAS variable ADYC
; R variable day
) represents the day at which the ordinal scale (worst value) was assessed for the previous 24h period. Here are the instructions on the CSO (Clinical Status and New Score) data collection form:
From Jennifer Ferreira 2021-04-13 and 2021-04-14:
The record that says “Baseline” for the day
ADYC
is at the time of randomization; day 1 value is the post-randomization value. The ADYC column includes the assessment day as based on the date highlighted in the eCRF below.
From Jennifer Ferreira 2021-04-16:
ADYC
doesn’t equal the collection/report date, it is the assessment date. It is collected/reported on Day 1, this corresponds toADYC=Baseline
in the dataset, similarly collection/reported Day 2 would have an assessment day of day 1 (ADYC=1
).ADYC=1
may or may not be pre-treatment. It is post randomization through midnight of the day of randomization, irrespective of tretment.
Ms Ferreira said this is what should be used:
Needed Quantity | Variable and/or ADYC to Pick |
---|---|
Last ordinal status at moment of or prior to randomization | ORDSCOR for ADYC=Baseline (identical to BCSOSN ) |
First ordinal status after randomization | ORSCOR for ADYC=1 |
Second ordinal status after randomization | ORDSCOR for ADYC=2 |
… | … |
Ms Ferreira clarified more things on 2021-04-30:
Data collection was done daily while the patient was in the hospital, so the presence of records can almost be used to determine the day of discharge home
Subjects only had ordinal scores completed while hospitalized. The only visits for subjects after discharge are Day 15, 22 and 29 (days post treatment 14, 21 and 28). Day 15 would be the first day we get the non-hospitalized ordinal score for those that discharged prior to Day 15, similar situation for the other days. The data we provided does not have length of hospital stay included. As you noted, in most cases, the time to recovery is the time to discharge (assuming the starting point is randomization and not hospital admission). As you point out, there are subjects that are recovered with OS 3. For these cases, since an ordinal score is required each day, their discharge day may either be the day of their last OS score or the first day without an OS score. Subject COV.00701 has TTR=5 and has assessments on days 1 2 3 4 5 14 21 26 with status 5 4 4 4 4 1 1 1. Day 5 OS of 4 would mean that on Day 5 (4 days after randomization), the subject was hospitalized for at least part of the day needing ongoing medical care, in this particular case the subject was hospitalized all day as the time to recovery of 5 days would mean that the subject was discharged on Day 6. Ordinal Score is reported as the worst status on the day prior, so if a subject was hospitalized needing ongoing medical care at any time that day, their score would be at least a 4. The subject can still be discharged that day.
Some recoding was done:
require(Hmisc)
knitrSet(lang='markdown',w=7,h=6)
require(haven)
require(data.table)
require(ggplot2)
s <- read_sas('actt1_longitudinal_ordinal_data_updated.sas7bdat')
d <- upData(s, lowernames=TRUE, moveUnits=TRUE,
rename=c(usubjid='id', adyc='day', agec='age', comorb2='co', arm='tx',
or15scor='y15', ordscor='y', bcsosn='y0', bdursymp='dursx'),
labels=c(age='Age',
tx = 'Assigned Treatment',
y0 = 'Baseline Severity',
y = 'Ordinal Outcome',
co = 'Comborbidities',
day= 'Day'),
day = ifelse(day == 'Baseline', 0L, as.integer(day)),
age = ifelse(age == '>89', 90, as.numeric(age)),
co = c('No Comorbidities'=0, '1 Comorbidity'=1,
'2 or more Comorbidities'=2, 'Unknown'=NA)[co]
)
Input object size: 2146808 bytes; 15 variables 17210 observations
Label for d29dthe0 changed from Time to Death (days) to Time to Death
units set to days
Label for d29dthe1 changed from Time to Death Censor Time Point (days) to Time to Death Censor Time Point
units set to days
Label for ttrecov0 changed from Time to Recovery (days) to Time to Recovery
units set to days
Label for ttrecov1 changed from Time to Recovery Censor Time Point (days) to Time to Recovery Censor Time Point
units set to days
Label for agec changed from Age (years) to Age
units set to years
Label for bdursymp changed from Duration of Symptoms (days) to Duration of Symptoms
units set to days
Renamed variable usubjid to id
Renamed variable adyc to day
Renamed variable agec to age
Renamed variable comorb2 to co
Renamed variable arm to tx
Renamed variable or15scor to y15
Renamed variable ordscor to y
Renamed variable bcsosn to y0
Renamed variable bdursymp to dursx
Modified variable day
Modified variable age
Modified variable co
New object size: 1527040 bytes; 15 variables 17210 observations
setDT(d, key=c('id', 'day'))
d[, table(actarm, tx)]
tx
actarm Placebo Remdesivir
Not Treated 4 10
Placebo 8130 0
Remdesivir 29 8278
Unplanned Treatment 730 29
d[day == 0, sum(is.na(y0 + y))]
[1] 0
d[day == 0, table(y0, y, useNA='ifany')] # BCSOSN identical to ADYC=Baseline ORDSCOR
y
y0 4 5 6 7
4 138 0 0 0
5 0 435 0 0
6 0 0 193 0
7 0 0 0 285
Look at data for those 11 patients who will later be deleted.
d[is.na(day), ]
id tx actarm y0 d29dthe0 d29dthe1 ttrecov0 ttrecov1 day
1: COV.00773 Remdesivir Not Treated NA NA 0 NA 0 NA
2: COV.00855 Remdesivir Not Treated NA NA 0 NA 0 NA
3: COV.01173 Placebo Not Treated NA NA 0 NA 0 NA
4: COV.01263 Remdesivir Not Treated NA NA 0 NA 0 NA
5: COV.01311 Remdesivir Not Treated NA NA 0 NA 0 NA
6: COV.01435 Placebo Not Treated NA NA 0 NA 0 NA
7: COV.01619 Remdesivir Not Treated NA NA 1 NA 1 NA
8: COV.01727 Placebo Not Treated NA NA 0 NA 0 NA
9: COV.01804 Remdesivir Not Treated NA NA 1 NA 0 NA
10: COV.01806 Remdesivir Not Treated NA NA 0 NA 0 NA
11: COV.01807 Remdesivir Not Treated NA NA 1 NA 1 NA
y y15 age sex dursx co
1: NA 7 30 M 16 NA
2: NA 7 41 F 6 NA
3: NA 7 65 F 8 1
4: NA 7 54 F 8 NA
5: NA 7 49 F 12 NA
6: NA 7 61 M 14 NA
7: NA 7 74 F 14 NA
8: NA 7 71 M 7 NA
9: NA 7 65 M 9 NA
10: NA 7 51 M 1 NA
11: NA 7 76 M 12 NA
Check agreement between baseline and adyc=1
status and between adyc=1
status and adyc=2
status.
w <- d[day %in% 0:2, .(id, day, y)]
w <- dcast(w, id ~ paste0('y', day), value.var='y')
w[, table(y0, y1, useNA='ifany')]
y1
y0 3 4 5 6 7 <NA>
4 1 104 20 3 1 9
5 0 8 331 33 22 41
6 0 3 13 145 16 16
7 0 0 3 3 245 34
w[, table(y1, y2, useNA='ifany')]
y2
y1 1 3 4 5 6 7 8 <NA>
3 0 1 0 0 0 0 0 0
4 0 1 91 14 2 0 0 7
5 2 1 26 308 16 6 2 6
6 0 0 1 18 138 24 2 1
7 0 0 0 1 5 273 4 1
<NA> 0 0 10 35 13 36 1 5
Check missingness patterns for early days. There were 13 patients with no record for (originally numbered) days 2-5. These seem to be patients going home early with large gaps, as shown in the listing below. Note that days 1-5 refer to patient status on day 0-4.
days <- 1 : 5
w <- d[day %in% days, .(id, day, y)]
w <- dcast(w, id ~ day, value.var='y')
z <- rep('', nrow(w))
for(i in days) z <- paste0(z, 1 * is.na(w[[as.character(i)]]))
table(z)
z
00000 00001 00010 00011 00110 00111 01000 01010 01111 10000 10001 10010 10011
760 74 6 46 8 42 1 1 13 88 2 1 4
# combplotp(~ is.na(y1) + is.na(y2) + ..., data=w)
idm <- w$id[z == '01111']
k <- d[id %in% idm, .(id, day, y, ttrecov0, ttrecov1)]
k
id day y ttrecov0 ttrecov1
1: COV.00735 0 5 1 NA
2: COV.00735 1 5 1 NA
3: COV.00735 15 1 1 NA
4: COV.00735 22 1 1 NA
5: COV.00735 29 1 1 NA
6: COV.00757 0 6 NA 0
7: COV.00757 1 6 NA 0
8: COV.00788 0 6 1 NA
9: COV.00788 1 4 1 NA
10: COV.00788 15 2 1 NA
11: COV.00788 22 1 1 NA
12: COV.00788 28 1 1 NA
13: COV.00952 0 5 1 NA
14: COV.00952 1 5 1 NA
15: COV.00952 14 1 1 NA
16: COV.00952 20 1 1 NA
17: COV.00952 28 1 1 NA
18: COV.01003 0 5 1 NA
19: COV.01003 1 5 1 NA
20: COV.01003 14 1 1 NA
21: COV.01003 21 1 1 NA
22: COV.01003 25 1 1 NA
23: COV.01093 0 4 1 NA
24: COV.01093 1 4 1 NA
25: COV.01093 14 1 1 NA
26: COV.01093 21 1 1 NA
27: COV.01093 27 1 1 NA
28: COV.01123 0 4 1 NA
29: COV.01123 1 4 1 NA
30: COV.01123 15 2 1 NA
31: COV.01123 22 1 1 NA
32: COV.01123 29 1 1 NA
33: COV.01139 0 4 1 NA
34: COV.01139 1 4 1 NA
35: COV.01139 21 1 1 NA
36: COV.01139 33 2 1 NA
37: COV.01209 0 5 1 NA
38: COV.01209 1 5 1 NA
39: COV.01209 17 1 1 NA
40: COV.01209 22 1 1 NA
41: COV.01209 31 1 1 NA
42: COV.01309 0 5 1 NA
43: COV.01309 1 5 1 NA
44: COV.01309 16 1 1 NA
45: COV.01309 19 1 1 NA
46: COV.01309 30 1 1 NA
47: COV.01319 0 4 1 NA
48: COV.01319 1 4 1 NA
49: COV.01319 14 1 1 NA
50: COV.01319 19 1 1 NA
51: COV.01319 30 2 1 NA
52: COV.01572 0 7 NA 0
53: COV.01572 1 7 NA 0
54: COV.01666 0 4 1 NA
55: COV.01666 1 4 1 NA
56: COV.01666 12 1 1 NA
57: COV.01666 22 1 1 NA
58: COV.01666 28 1 1 NA
id day y ttrecov0 ttrecov1
idm <- w$id[substring(z, 1, 1) == '1']
k <- d[id %in% idm, .(id, day, y, ttrecov0, ttrecov1)]
# View(k)
Drop the 11 subjects with missing day.
d <- d[! is.na(day), ]
# For each subject create a sequential record number
d[, seq := 0 : (.N - 1), by=id]
d[day < 2, table(day, seq, useNA='ifany')]
seq
day 0 1
0 1051 0
1 0 951
# Check treatment counts vs Table 2 in paper
d[day == 0, table(tx)] # doesn't match: get 518 (missing 3) 533 (missing 8)
tx
Placebo Remdesivir
518 533
# The 11 missing are the ones described above
Compute day of death (NA
if not died).
# Compute day of death
d[, ddeath := if(any(y == 8)) min(day[y == 8]) else NA_integer_, by=id]
# View(d[, .(id, seq, day, d29dthe0, ddeath, y, y15)])
d[, sum(day > ddeath, na.rm=TRUE)]
[1] 449
# Remove records where death was carried forward
dim(d)
[1] 17199 17
d <- d[day <= ddeath | is.na(ddeath), ]
dim(d)
[1] 16750 17
# Check death count against Table 2 in paper
d[seq == 0, table(tx, ! is.na(ddeath))]
tx FALSE TRUE
Placebo 441 77
Remdesivir 473 60
d[, table(y == 8)] # OK: 137 vs. 136 death by day 29
FALSE TRUE
16613 137
# Check distribution of y on day 15 vs paper
d[seq == 0, table(y15, tx)] # matches except missing 11 with y=7
tx
y15 Placebo Remdesivir
1 115 157
2 102 117
3 8 14
4 33 38
5 60 58
6 24 28
7 118 87
8 58 34
d[day == 15, table(y, tx)] # only 1/2 of patients have day 15 records
tx
y Placebo Remdesivir
1 27 37
2 27 21
3 8 10
4 32 36
5 45 45
6 23 17
7 109 79
8 4 1
# Compare to time to death variable
d[, table(d29dthe0 - ddeath)]
-5 -4 -1
6 23 1676
d[, sum(d29dthe0 - ddeath != -1, na.rm=TRUE)]
[1] 29
w <- d[d29dthe0 - ddeath != -1, .(id, day, y, d29dthe0)] # 3 subjects, 29 records
print(w)
id day y d29dthe0
1: COV.01144 0 6 24
2: COV.01144 1 6 24
3: COV.01144 2 6 24
4: COV.01144 3 7 24
5: COV.01144 4 7 24
6: COV.01144 29 8 24
7: COV.01204 0 4 18
8: COV.01204 1 5 18
9: COV.01204 2 4 18
10: COV.01204 3 4 18
11: COV.01204 4 5 18
12: COV.01204 5 5 18
13: COV.01204 12 2 18
14: COV.01204 22 8 18
15: COV.01522 0 7 18
16: COV.01522 1 7 18
17: COV.01522 2 7 18
18: COV.01522 3 7 18
19: COV.01522 4 7 18
20: COV.01522 5 7 18
21: COV.01522 6 7 18
22: COV.01522 7 7 18
23: COV.01522 8 7 18
24: COV.01522 9 7 18
25: COV.01522 10 7 18
26: COV.01522 11 7 18
27: COV.01522 12 7 18
28: COV.01522 13 7 18
29: COV.01522 22 8 18
id day y d29dthe0
Check that there is only one death record per subject and that when there is a death record it is always the last record. Compute last day in hospital.
# Tabulate number of death records per patient
d[, .(n8=sum(y == 8)), by=id][, table(n8)]
n8
0 1
914 137
d[y == 8, .(fromlast=max(day) - day), by=id][, table(fromlast)]
fromlast
0
137
# Compute last day known to be in the hospital at least part of that day
d[, dlasthosp := max(day[y >= 3 & y < 8]), by=id]
# Some patients un-recovered and had to be readmitted. For those dlasthosp is the last day recorded in hospital.
# Compute dlasthospf which is the last day in the first hospital admission
# To compute this first compute dhome = first day with home status
d[, dhome := min(day[y < 3]), by=id] # NA if no at home records
d[, dlasthospf := max(day[y >= 3 & y < 8 & (is.na(dhome) | day < dhome)]), by=id]
z <- d[is.na(dlasthosp) | is.na(dlasthospf) | dlasthosp != dlasthospf, .(id, day, y, dhome, dlasthosp, dlasthospf)]
# View(z)
# Count number of patients who unrecovered
d[seq == 1, sum(dlasthosp != dlasthospf)]
[1] 28
Count the number of patients by treatment category, and check this against baseline records. Also check that every patient has only one treatment code.
d[, .(n=length(unique(id))), by=tx]
tx n
1: Placebo 518
2: Remdesivir 533
d[day == 0, .N, by=tx]
tx N
1: Placebo 518
2: Remdesivir 533
table(d[, .(n=length(unique(tx))), by=id]$n)
1
1051
Examine gaps in longitudinal measurements. Check that death is an absorbing state.
d[, gap := ifelse(day == 0, 0, day - shift(day)), by=id]
# Get distribution of gaps before and after going home the first time
w <- d[, .(athome = day > dlasthospf, gap)]
w[, table(gap, athome)]
athome
gap FALSE TRUE
0 1051 0
1 13556 289
2 99 58
3 2 53
4 2 109
5 4 143
6 2 245
7 2 506
8 2 229
9 1 110
10 0 109
11 0 52
12 0 54
13 0 32
14 0 20
15 0 7
16 0 2
17 0 5
19 0 1
20 0 1
21 0 1
22 0 1
25 0 1
29 0 1
# ggplot(d, aes(x=gap)) + geom_histogram()
d[, ggplotlyr(ggfreqScatter(day, gap, by=y))]
w <- d[day > 0 & gap == 0, .(id, day, gap)]
w
Empty data.table (0 rows and 3 cols): id,day,gap
d[id %in% w$id, .(id, day)]
Empty data.table (0 rows and 2 cols): id,day
w <- d[y == 8, .(dgap = .N - (seq + 1))]
w[, table(dgap)]
dgap
105 107 109 110 111 112 113 114 116 117 118 119 120 121 122 123 124 125 126 127
1 2 2 3 1 3 3 2 6 4 2 4 4 4 12 6 6 7 3 3
128 129 130 131 132 133 134 135
5 10 13 8 7 6 9 1
Check day 15 status vs. nearby daily status while in hospital. You can see that there are 15 patients with day 15 y=5 and the record on day 15 has y=4.
d[seq == 0, table(day <= dlasthosp)]
TRUE
1051
# Get patients not dying or discharged to home before day 16
w <- d[(is.na(ddeath) | ddeath >= 16) & dlasthosp > 16, ]
w[seq == 0, .N]
[1] 365
w[day %in% 13:17, .N, by=day]
day N
1: 13 351
2: 14 358
3: 15 353
4: 16 350
5: 17 350
for(i in 13 : 17) {
cat('\nDay', i, 'since randomization\n')
print(w[day == i, table(y15, y)])
}
Day 13 since randomization
y
y15 2 3 4 5 6 7
3 0 7 2 0 0 0
4 0 0 34 6 0 0
5 1 1 1 48 13 9
6 0 0 0 3 31 6
7 0 0 0 0 4 185
Day 14 since randomization
y
y15 1 2 3 4 5 6 7
1 3 0 0 0 0 0 0
2 0 3 0 0 0 0 0
3 0 0 9 0 0 0 0
4 0 0 0 42 0 0 0
5 0 0 0 0 68 2 2
6 0 0 0 0 1 39 0
7 0 0 0 0 0 0 189
Day 15 since randomization
y
y15 2 3 4 5 6 7
2 2 0 0 1 0 0
3 0 8 1 0 0 0
4 0 1 40 1 0 0
5 0 1 8 58 3 1
6 0 1 1 9 25 3
7 0 0 0 2 9 178
Day 16 since randomization
y
y15 2 3 4 5 6 7
2 1 0 0 1 0 0
3 0 8 1 0 0 0
4 0 0 36 4 0 0
5 0 2 13 52 3 1
6 0 1 1 14 21 2
7 0 0 0 10 14 165
Day 17 since randomization
y
y15 2 3 4 5 6 7
2 1 0 1 0 0 0
3 0 8 0 1 0 0
4 0 2 36 2 0 0
5 0 3 21 44 2 1
6 0 0 2 16 19 2
7 0 1 0 20 15 153
# View(w[,.(id, day, y, y15, gap)])
Get the frequency distribution of the day 15 status variable for those patients going home before day 15 and also compute the day of the patient’s last record.
w <- d[dlasthospf <= 15, .(y15=y15[1], lastday=max(day, na.rm=TRUE)), by=id]
w[, table(lastday, y15, useNA='ifany')]
y15
lastday 1 2 3 4 5 6 7 8
0 0 0 0 0 4 1 0 0
1 0 0 0 0 0 1 1 0
2 0 1 0 0 0 2 0 9
3 0 1 0 0 1 2 1 7
4 0 1 0 0 1 0 1 6
5 0 0 0 0 0 1 0 6
6 0 1 0 0 0 0 1 14
7 0 0 0 1 1 0 0 9
8 0 0 0 0 1 0 0 4
9 0 0 0 0 0 0 0 4
10 0 1 0 0 0 0 0 4
11 0 0 0 0 0 0 1 7
12 0 0 0 0 0 0 0 4
13 2 0 0 0 0 0 0 6
14 0 0 1 1 0 0 0 11
15 1 0 0 2 0 1 5 1
16 0 1 0 1 1 0 1 0
18 0 1 0 0 0 0 0 0
19 1 0 0 0 0 0 0 0
21 2 1 0 0 0 0 0 0
22 0 1 0 0 0 0 1 0
23 1 0 0 0 0 0 0 0
24 1 0 0 0 0 0 1 0
25 11 9 1 2 0 1 0 0
26 13 16 1 1 2 0 0 0
27 31 28 0 3 8 0 0 0
28 94 68 3 13 10 1 0 0
29 62 37 0 3 5 0 1 0
30 30 27 1 2 1 0 0 0
31 16 15 2 1 2 1 0 0
32 4 6 0 0 0 0 0 0
33 3 4 0 0 0 0 0 0
Check that alive patients almost always have daily ordinal assessments while in hospital (y > 2).
d[day > 0 & day <= dlasthosp & (is.na(ddeath) | day < ddeath), table(gap)]
gap
1 2 3 4 5 6 7 8 9 10 11
13691 102 2 12 8 10 11 7 3 1 3
Check censoring day against last day with an ordinal assessment. Last contact day for vital status censoring purposes is d29dth1 + 1
.
d[, dcens := pmax(d29dthe0, d29dthe1, ttrecov0 + 1, ttrecov1 + 1, na.rm=TRUE)]
# View(d[, .(id, day, y, ddeath, dcensd)])
z <- d[day <= 28, .(maxday = max(day), dcens=dcens[1]), by=.(id, nad=is.na(d29dthe0))]
z[, table(nad, maxday - dcens)]
nad -27 -26 -25 -24 -23 -22 -21 -20 -19 -18 -17 -16 -15 -14 -13 -12 -11 -10
FALSE 9 7 7 6 14 9 4 4 4 7 4 6 10 5 3 5 2 4
TRUE 0 0 0 1 0 0 0 0 0 0 0 0 0 2 4 3 3 7
nad -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 4
FALSE 5 0 2 4 3 0 4 2 2 0 3 1
TRUE 10 9 72 85 62 28 34 15 54 524 2 0
# 4 pts with maxday > day of death
# Take daily assessments to current max day or dcens whichever is greater
For days between last ordinal assessment in the first hospital admission and death (if any), we know the patient is alive and at home. Add daily records to indicate this, with y=1.5 to indicate “at home, activity limits and O2 status unknown”. Stop at day 29. Note that when a patient is kept in the hospital for infection control (y=3) they still get daily outcome assessments. If the patient has a missing y at day=1 set y to the baseline state.
# First create a one row per patient data table with information that doesn't change over days
u <- d[seq == 0, .(id, tx, ddeath, dhome, dlasthosp, dlasthospf, d29dthe0, d29dthe1, actarm, y0, ttrecov0, ttrecov1, y15,
age, sex, dursx, co, dcens)]
setkey(u, 'id')
# Create a dataset with everyone having 28d of data
w <- expand.grid(id=unique(d$id), day=1:28)
setDT(w, key=c('id', 'day'))
# Next line is same as w <- merge(w, d[, .(id, day, y)], by=c('id', 'day'), all.x=TRUE)
w <- d[w, .(id, day, y), nomatch=NA]
w <- u[w, nomatch=NA]
w[is.na(y) & day == 1, y := y0]
w[, y := as.numeric(y)]
w[is.na(y) & day > dlasthospf & day <= dcens & (is.na(ddeath) | day < ddeath), y := 1.5]
w <- w[! is.na(y), ]
# Look at 10 randomly chosen patients
ids <- sort(sample(w$id, 10))
if(FALSE) for(i in ids) {
xless(d[i, .(day, y, dhome, ddeath, dlasthosp, dlasthospf, dcens)], title=i)
xless(w[i, .(day, y, dhome, ddeath, dlasthosp, dlasthospf, dcens)], title=i)
readLines(n=1)
system('pkill xless')
}
setkey(w, id, day)
d <- w
Add the previous state to each record, and describe the new analysis file.
d[, yprev := ifelse(day == 1, y0, shift(y)), by=id]
d[, gap := ifelse(day == 1, 1, day - shift(day)), by=id]
# Remove baseline records
d <- d[day > 0, ]
# Add sequential record number
d[, seq := (1 : .N), by=id]
d <- upData(d,
labels=c(yprev = 'Previous Outcome Status',
day = 'Day',
seq = 'Sequential Record Number',
ddeath = 'Day of First Record With Status = Death',
dhome = 'First day with status at home (NA if never)',
dlasthosp = 'Last day in hospital with outcome assessment',
dlasthospf= 'Last day in first hospital admission with outcome assessment',
dcens = 'Day of censoring for TTR/death',
gap = 'Days Since Previous Record',
y0 = 'Baseline Status',
y = 'Outcome Status'))
Input object size: 3326680 bytes; 23 variables 26167 observations
New object size: 3122288 bytes; 23 variables 26167 observations
html(describe(d))
n | missing | distinct |
---|---|---|
26167 | 0 | 1051 |
lowest : | COV.00701 | COV.00702 | COV.00703 | COV.00704 | COV.00705 |
highest: | COV.01802 | COV.01803 | COV.01805 | COV.01808 | COV.01809 |
n | missing | distinct |
---|---|---|
26167 | 0 | 2 |
Value Placebo Remdesivir Frequency 12743 13424 Proportion 0.487 0.513
n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
---|---|---|---|---|---|---|---|---|---|---|---|---|
1674 | 24493 | 27 | 0.997 | 16.93 | 8.534 | 5 | 6 | 11 | 17 | 23 | 27 | 29 |
n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
---|---|---|---|---|---|---|---|---|---|---|---|---|
19070 | 7097 | 32 | 0.974 | 17.25 | 5.53 | 12 | 13 | 14 | 15 | 21 | 27 | 28 |
n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
---|---|---|---|---|---|---|---|---|---|---|---|---|
26167 | 0 | 33 | 0.997 | 14.29 | 10.73 | 2 | 3 | 6 | 12 | 24 | 28 | 29 |
n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
---|---|---|---|---|---|---|---|---|---|---|---|---|
26167 | 0 | 33 | 0.997 | 13.86 | 10.56 | 2 | 3 | 5 | 11 | 22 | 28 | 29 |
n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
---|---|---|---|---|---|---|---|---|---|---|---|---|
1646 | 24521 | 26 | 0.997 | 15.55 | 8.174 | 4 | 5 | 10 | 15 | 22 | 25 | 26 |
n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
---|---|---|---|---|---|---|---|---|---|---|---|---|
24521 | 1646 | 23 | 0.504 | 27.45 | 0.9683 | 25 | 26 | 28 | 28 | 28 | 28 | 28 |
n | missing | distinct |
---|---|---|
26167 | 0 | 4 |
Value Not Treated Placebo Remdesivir Frequency 4 12017 13421 Proportion 0.000 0.459 0.513 Value Unplanned Treatment Frequency 725 Proportion 0.028
n | missing | distinct | Info | Mean | Gmd |
---|---|---|---|---|---|
26167 | 0 | 4 | 0.898 | 5.56 | 1.12 |
Value 4 5 6 7 Frequency 3651 11094 4531 6891 Proportion 0.140 0.424 0.173 0.263
n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
---|---|---|---|---|---|---|---|---|---|---|---|---|
20486 | 5681 | 33 | 0.997 | 10.16 | 7.918 | 2 | 3 | 4 | 8 | 14 | 21 | 25 |
n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
---|---|---|---|---|---|---|---|---|---|---|---|---|
5681 | 20486 | 18 | 0.859 | 27.1 | 1.928 | 25 | 27 | 27 | 27 | 28 | 28 | 30 |
Value 0 1 2 3 4 5 6 7 10 13 14 15 Frequency 7 6 12 10 6 6 15 8 12 14 29 17 Proportion 0.001 0.001 0.002 0.002 0.001 0.001 0.003 0.001 0.002 0.002 0.005 0.003 Value 24 25 26 27 28 30 Frequency 50 156 135 2464 2203 531 Proportion 0.009 0.027 0.024 0.434 0.388 0.093
n | missing | distinct | Info | Mean | Gmd |
---|---|---|---|---|---|
26167 | 0 | 8 | 0.955 | 3.597 | 2.693 |
Value 1 2 3 4 5 6 7 8 Frequency 7479 5934 586 1862 3033 1231 5322 720 Proportion 0.286 0.227 0.022 0.071 0.116 0.047 0.203 0.028
n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
---|---|---|---|---|---|---|---|---|---|---|---|---|
26167 | 0 | 68 | 1 | 57.98 | 16.62 | 32 | 38 | 48 | 59 | 68 | 77 | 82 |
n | missing | distinct |
---|---|---|
26167 | 0 | 2 |
Value F M Frequency 9309 16858 Proportion 0.356 0.644
n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
---|---|---|---|---|---|---|---|---|---|---|---|---|
26083 | 84 | 35 | 0.995 | 9.865 | 5.669 | 3 | 4 | 6 | 9 | 12 | 17 | 20 |
n | missing | distinct | Info | Mean | Gmd |
---|---|---|---|---|---|
26082 | 85 | 3 | 0.816 | 1.348 | 0.8082 |
Value 0 1 2 Frequency 5035 6931 14116 Proportion 0.193 0.266 0.541
n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
---|---|---|---|---|---|---|---|---|---|---|---|---|
26167 | 0 | 29 | 0.68 | 27.65 | 1.254 | 25 | 26 | 28 | 28 | 28 | 29 | 29 |
n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
---|---|---|---|---|---|---|---|---|---|---|---|---|
26167 | 0 | 28 | 0.999 | 13.88 | 9.243 | 2 | 3 | 7 | 14 | 21 | 25 | 27 |
n | missing | distinct | Info | Mean | Gmd |
---|---|---|---|---|---|
26167 | 0 | 9 | 0.916 | 3.773 | 2.545 |
Value 1.0 1.5 2.0 3.0 4.0 5.0 6.0 7.0 8.0 Frequency 930 10765 536 348 2107 3785 1813 5750 133 Proportion 0.036 0.411 0.020 0.013 0.081 0.145 0.069 0.220 0.005
n | missing | distinct | Info | Mean | Gmd |
---|---|---|---|---|---|
26167 | 0 | 8 | 0.92 | 3.875 | 2.529 |
Value 1.0 1.5 2.0 3.0 4.0 5.0 6.0 7.0 Frequency 664 10439 415 336 2207 4170 1987 5949 Proportion 0.025 0.399 0.016 0.013 0.084 0.159 0.076 0.227
n | missing | distinct | Info | Mean | Gmd |
---|---|---|---|---|---|
26167 | 0 | 8 | 0.002 | 1.002 | 0.003591 |
Value 1 2 3 4 5 6 7 8 Frequency 26153 3 2 2 4 1 1 1 Proportion 0.999 0.000 0.000 0.000 0.000 0.000 0.000 0.000
n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
---|---|---|---|---|---|---|---|---|---|---|---|---|
26167 | 0 | 28 | 0.999 | 13.86 | 9.231 | 2 | 3 | 7 | 14 | 21 | 25 | 27 |
# Visually check key columns
# View(d[, .(id, seq, day, y0, yprev, y)])
saveRDS(d, 'actt1.rds', compress='xz')
Compute all consecutive day transition proportions to see if for some days the ordinal scale is always the same as the previous day. That would indicate carry forward instead of real daily assessment.
ordlabels <- c('not hosp, no limits', 'not hosp', 'not hosp, limits or O2',
'hosp, no O2, no care', 'hosp, no O2, care',
'hosp, O2', 'hosp, noninvasive vent', 'hosp, invasive vent', 'death')
require(plotly)
g <- propsTrans(y ~ day + id, data=d, maxsize=2, labels=ordlabels) +
theme(legend.position='bottom')
ggplotlyr(g) # ggplotlyr in Hmisc to fix tooltip hovertext
To read the R file do:
require(Hmisc)
require(data.table)
d <- readRDS('actt1.rds')
contents(d) # show all the metadata
The records stop at day 28 of follow-up.
The status variables y, yprev, y0, y15
are coded 1, 1.5, 2, 3, …, 8. 1.5 stands for the patient being at home but it is not known whether y=1 or 2 because a daily outcome assessment was not made on that dasy. To turn these variables into factors with value labels do the following:
ordlabels <- c('not hosp, no limits', 'not hosp', 'not hosp, limits or O2',
'hosp, no O2, no care', 'hosp, no O2, care',
'hosp, O2', 'hosp, noninvasive vent', 'hosp, invasive vent', 'death')
g <- function(x) factor(x, c(1, 1.5, 2:8), ordlabels)[,drop=TRUE]
d[, y := g(y)]
d[, y15 := g(y15)]
d[, y0 := g(y0)]
d[, yprev := g(yprev)]
# To do all at once:
d[, c('y', 'y15', 'y0', 'yprev') := list(g(y), g(y15), g(y0), g(yprev))]
The drop=TRUE
will force y0
or yprev
to only take on the 4 observed levels. Note that with data.table
, :=
means “define a new variable and store it in-place in the existing data table”.
Note that y15
is the NIAID-provided day 15 status variable, which doesn’t always match up with the outcome status in the day 15 record.
This dataset excludes 11 patients for whom baseline severity wasn’t collected and who immediately dropped out of the study.
Since this data table does not carry death forward, you need to do the following to empirically look at state occupancy proportions which require death to be carried forward. Here we carry death forward to day 28.
nrow(d) # rows before carrying forward
[1] 26167
w <- d[, .(dlast=day, last8 = day == max(day) & y == 8, tx=tx), by=id]
w <- w[last8 == TRUE, .(day = (dlast + 1) : 28,
y = rep(8, 28 - dlast), tx=tx),
by=id]
label(w$day) <- label(d$day)
label(w$y) <- label(d$y)
w <- rbind(d[, .(id, day, y, tx)], w)
setkey(w, id, day)
nrow(w) # rows after carrying forward
[1] 28326
The following also works and is a bit more elegant. For more data.table
coding options see this.
u <- d[, .SD[(day == max(day) & y == 8),], by=id,
.SDcols=c('day', 'tx', 'y')]
u <- u[, .(day = (day + 1) : 28, y=rep(8, 28 - day), tx=tx), by=id]
label(u$day) <- label(d$day)
label(u$y) <- label(d$y)
u <- rbind(d[, .(id, day, y, tx)], u)
setkey(u, id, day)
identical(w, u)
[1] TRUE
As an alternative, the following fills in missing days before death with y=1.5 (home, unspecified) and carries death forward to day 28. This is useful for descriptive state occupancy statistics. For patients with missing day 1 records assume y is the baseline state. Don’t fill-in missing y for non-deaths with 1.5 at day last known to be in hospital (dlasthosp
). But don’t use this method since the 1.5s have already been filled in in the master analysis file.
# First create a one row per patient data table with information that doesn't change over days
u <- d[seq == 1, .(id, tx, y0, ddeath, dlasthosp, ttrecov0, ttrecov1)]
setkey(u, 'id')
# Create a dataset with everyone having 28d of data
w <- expand.grid(id=unique(d$id), day=1:28)
setDT(w, key=c('id', 'day'))
# Next line is same as w <- merge(w, d[, .(id, day, y)], by=c('id', 'day'), all.x=TRUE)
w <- d[w, .(id, day, y), nomatch=NA]
w <- u[w, nomatch=NA]
w[, y := ifelse(! is.na(y), y,
ifelse(day == 1, y0,
ifelse(is.na(ddeath), 1.5, 8 * (day >= ddeath))))]
ids <- c('COV.00701', 'COV.01793', 'COV.01006', 'COV.01144')
d[id %in% ids, .(id, day, y, tx)]
# xless(w[id %in% ids, ], topn=400)
label(w$day) <- label(d$day)
label(w$y) <- label(d$y)
dcf <- w
Compute for each patient the number of days until y < 4. Death counts as non-recovery at day 28 per ACTT-1.
u <- d[, .(ttr = if(any(y == 8) || ! any(y < 4)) 28L else min(day[y < 4]),
ttro= ifelse(is.na(ttrecov0), ttrecov1, ttrecov0)[1],
ttrecov0=ttrecov0[1], ttrecov1=ttrecov1[1],
y0=y0[1]),
by=.(id, tx)]
u[, .(Mean=mean(ttr), Median=median(ttr)), by=tx]
tx Mean Median
1: Placebo 17.27992 17
2: Remdesivir 14.80488 12
u[, .(Mean=mean(ttr), Median=median(ttr)), keyby=.(y0, tx)]
y0 tx Mean Median
1: 4 Placebo 10.12698 7
2: 4 Remdesivir 8.16000 6
3: 5 Placebo 14.06897 10
4: 5 Remdesivir 11.05172 8
5: 6 Placebo 18.67347 20
6: 6 Remdesivir 17.72632 17
7: 7 Placebo 23.55195 28
8: 7 Remdesivir 23.13740 28
ggplot(u, aes(x=ttr)) + geom_histogram() + facet_grid(y0 ~ tx) +
xlab('Time to Recovery Using Daily States') + ylab('')
u[, .(Mean=mean(ttro), Median=median(ttro)), by=tx]
tx Mean Median
1: Placebo 15.91699 14.5
2: Remdesivir 13.28143 9.0
u[, .(Mean=mean(ttro), Median=median(ttro)), keyby=.(y0, tx)]
y0 tx Mean Median
1: 4 Placebo 8.523810 6.0
2: 4 Remdesivir 6.680000 5.0
3: 5 Placebo 12.379310 9.0
4: 5 Remdesivir 9.469828 7.0
5: 6 Placebo 18.142857 19.5
6: 6 Remdesivir 15.863158 12.0
7: 7 Placebo 22.188312 26.0
8: 7 Remdesivir 21.938931 26.0
u[, ggfreqScatter(ttrecov0, ttr)]
u[, ggfreqScatter(ttrecov1, ttr)]
u[, ggfreqScatter(ttro, ttr)]
i <- u[, .(id=id, susp=! is.na(ttrecov0) & abs(ttr - ttrecov0) > 2)]
# i <- u[, .(id=id, susp=abs(ttr - ttro) > 2)]
table(i$susp)
FALSE TRUE
1030 21
# Get raw data for those with large differences
ttri <- u$ttr
names(ttri) <- u$id
idsusp <- i[susp == TRUE, id]
z <- d[id %in% idsusp, .(id, day, y, ttr=ttri[id], ttrecov0, ttrecov1, ddeath, dlasthosp)]
# View(z)
R version 4.1.0 (2021-05-18) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Pop!_OS 20.10 Matrix products: default BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0 LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] plotly_4.9.3 data.table_1.13.0 haven_2.3.1 Hmisc_4.6-0 [5] ggplot2_3.3.2 Formula_1.2-3 survival_3.2-11 lattice_0.20-44To cite R in publications use:
R Core Team (2021). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/.
To cite the Hmisc package in publications use:Harrell Jr F (2021). Hmisc: Harrell Miscellaneous. R package version 4.6-0, https://hbiostat.org/R/Hmisc/.