Load(ssafety)
ssafety <- upData(ssafety, rdate=as.Date(rdate),
                  smoking=factor(smoking, 0:1, c('No','Yes')),
                  labels=c(smoking='Smoking', bmi='BMI',
                    pack.yrs='Pack Years', age='Age',
                    height='Height', weight='Weight'),
                  units=c(age='years', height='cm', weight='Kg'),
                  print=FALSE)
mtime <- function(f) format(file.info(f)$mtime)
datadate        <- mtime('ssafety.rda')
primarydatadate <- mtime('ssafety.rda')

## List of lab variables that are missing too much to be used
omit  <- Cs(amylase,aty.lymph,glucose.fasting,neutrophil.bands)

## Make a list that separates variables into major categories
vars <- list(baseline=Cs(age, sex, race, height, weight, bmi,
               smoking, pack.yrs),
             ae  =Cs(headache, ab.pain, nausea, dyspepsia, diarrhea,
                     upper.resp.infect, coad),
             ekg =setdiff(names(ssafety)[c(49:53,55:56)],
               'atrial.rate'),
             chem=setdiff(names(ssafety)[16:48],
               c(omit, Cs(lymphocytes.abs, atrial.rate,
                          monocytes.abs, neutrophils.seg,
                          eosinophils.abs, basophils.abs)))) 
week  <- ssafety$week
weeks <- sort(unique(week))
base  <- subset(ssafety, week==0)
denom <- c(c(enrolled=500, randomized=nrow(base)), table(base$trx))

sethreportOption(tx.var='trx', denom=denom)
## Initialize app.tex

Philosophy

The reporting tools used here are based on a number of lessons learned from the intersection of the fields of statistical graphics, graphic design, and cognitive psychology, especially from the work of Bill Cleveland, Ralph McGill, John Tukey, Edward Tufte, and Jacques Bertin.

Whenever largely numerical information is displayed, graphs convey the information most often needed much better than tables.
1. Tables usually show more precision than is warranted by the sample information while hiding important features.
2. Graphics are much better than tables for seeing patterns and anomalies.
The best graphics are ones that make use of features that humans are most accurate in perceiving, namely position along a common scale.
Information across multiple data categories is usually easier to judge when the categories are sorted by the numeric quantity underlying the information¹.
The most robust and informative descriptive statistics for continuous variables are quantiles and whole distribution summaries².
For group comparisons, confidence intervals for individual means, medians, or proportions are not very useful, and whether or not two confidence intervals overlap is not the correct statistical approach for judging the significance of the difference between the two. The half-width of the confidence interval for the difference, when centered at the midpoint of the two estimates, provides a succinct precision display, and this half-interval touches the two estimates if and only if there is no significant difference between the two.
Each graphic needs a marker that provides the reader with a sense of exactly what fraction of the sample is being analyzed in that graphic.
Tables are best used as backups to graphics.
Tables should emphasize estimates that are not functions of the sample size. For categorical variables, proportions have interpretations independent of sample size so they are the featured estimates, and numerators and denominators are subordinate to the proportions. For continuous variables, minimum and maximum, while useful for data quality checking, are not population parameters, and they expand as n↑, so they are not proper summary statistics.
With the availability of graphics that over hover text, it is more effective to produce tabular information on demand. The software used here will pop-up tabular information related to the point or group currently pointed to by the mouse. This makes it less necessary to produce separate tables.

Notation

Figure Captions

Needles represent the fraction of observations used in the current analysis. The first needle (red) shows the fraction of enrolled patients used. If randomization was taken into account, a second needle (green) represents the fraction of randomized subjects included in the analysis. When the analyses consider treatment assignment, two more needles may be added to the display, showing, respectively, the fraction of subjects randomized to treatment A used in the analysis and the fraction of subjects on treatment B who were analyzed. The colors of these last two needles are the colors used for the two treatments throughout the report. The following table shows some examples.

# Store using short variable names so Rmarkdown table column
# width will not be wider than actually needed
d1 <- dNeedle(1)
d2 <- dNeedle((3:4)/4)
d3 <- dNeedle((1:2)/4)
d4 <- dNeedle(c(1,2,3,1)/4)

Signpost	Interpretation
	All enrolled subjects analyzed, randomization not considered
	Analysis uses ³⁄₄ of enrolled subjects, and all randomized subjects
	Analysis uses ¹⁄₄ of enrolled subjects, and ¹⁄₂ of randomized subjects
	Same as previous example, and in addition the analysis utilized treatment assignment, analyzing ³⁄₄ of those randomized to A and ¹⁄₄ of those randomized to B

Dot Charts

Dot charts are used to present stratified proportions. Details, including all numerators and denominators of proportions, can be revealed by hovering the mouse over a point.

Survival Curves

Graphs containing pairs of Kaplan-Meier survival curves show a shaded region centered at the midpoint of the two survival estimates and having a height equal to the half-width of the approximate 0.95 pointwise confidence interval for the difference of the two survival probabilities. Time points at which the two survival estimates do not touch the shaded region denote approximately significantly different survival estimates, without any multiplicity correction. Hover the mouse to see numbers of subjects at risk at a specific follow-up time, and more information.

Introduction

This is a sample of the part of a closed meeting Data Monitoring Committee report that contains software generated results. Components related to efficacy, study design, data monitoring plan,³ summary of previous closed report, interpretation, protocol changes, screening, eligibility, and waiting time until treatment commencement are not included in this example⁴. This report used a random sample of safety data from a randomized clinical trial. Randomization date, dropouts, and compliance variables were simulated, the latter two not being made consistent with the presence or absence of actual data in the random sample. The date and time that the analysis file used here was last updated was2013-10-27 10:50:46. Source analysis files were last updated on primarydatadate.

Accrual

accrualReport(randomize(rdate) ~ site(site), data=base,
              dateRange=c('1990-01-01','1994-12-31'),
              targetDate='1994-12-31', targetN=300,
              closeDate=max(base$rdate))

Study Numbers
Number	Category
20	Sites
250	Participants randomized
12.5	Participants per site
20	Sites randomizing
12.5	Subjects randomized per randomizing site
59.4	Months from first subject randomized (1990-01-03) to 1994-12-15
1101.7	Site-months for sites randomizing
55.1	Average months since a site first randomized
0.23	Participants randomized per site per month

∟ Participants randomized over time

The blue line depicts the cumulative frequency. The thick grayscale line represent targets.

Category	N	Used
Enrolled	500	250
Randomized	250	250

∟ Number of sites × number of participantsrandomized

Number of sites having the given number of participants randomized

Category	N	Used
Enrolled	500	250
Randomized	250	250

∟ Participants randomized by site

Baseline Variables

# Simulate regions
set.seed(1)
base$region <- sample(c('north', 'south'), nrow(base), replace=TRUE)
dReport(sex + race + smoking ~ region + trx, groups='trx', data=addMarginal(base, region))

∟ Proportions for sex, race, and smoking stratified by region and treatment

Proportions for sex, race, and smoking stratified by region and treatment. N=250

Category	N	Used
Enrolled	500	250
Randomized	250	250
A	81	81
B	169	169

Variable	A	B
Sex	81	169
Race	81	169
Smoking	81	169

## Show spike histogram and quantiles for raw data
dReport(age + height + weight + bmi + pack.yrs ~ trx, data=base,
        popts=list(ncols=2))

∟ Histograms for age, height, weight, BMI, and pack years stratified by treatment

Histograms for age, height, weight, BMI, and pack years stratified by treatment. N=250

Category	N	Used
Enrolled	500	250
Randomized	250	250
A	81	81
B	169	169

Variable	A	B
Age	81	169
Height	81	169
Weight	81	169
BMI	81	169
Pack Years	81	169

Longitudinal Adverse Events

dReport(headache + ab.pain + nausea + dyspepsia + diarrhea +
        upper.resp.infect + coad ~ week + trx + id(id),
        groups='trx', data=ssafety, what='byx',
        popts=list(ncols=2, height=700, width=1100))

∟ Means and 0.95 bootstrap percentile confidence limits for 7 variables vs. week stratified by treatment

Means and 0.95 bootstrap percentile confidence limits for 7 variables vs. week stratified by treatment. N=250

Category	N	Used
Enrolled	500	250
Randomized	250	250
A	81	81
B	169	169

Variable	A	B
headache	81	169
abdominal pain	81	169
nausea	81	169
dyspepsia	81	169
diarrhea	81	169
upper resp tract infection	81	169
chronic obstructive airways disease	81	169

Incidence of Adverse Events at Any Follow-up

## Reformat to one record per event per subject per time
aev <- vars$ae
ev  <- ssafety[ssafety$week > 0, c(aev, 'trx', 'id', 'week')]
## Reshape to tall and thin format
evt <- reshape(ev, direction='long', idvar=c('id', 'week'),
               varying=aev, v.names='sev', timevar='event',
               times=aev)
## For each event, id and trx see if event occurred at any week
ne <- with(evt, summarize(sev, llist(id, trx, event),
                          function(y) any(y > 0, na.rm=TRUE)))
## Remove non-occurrences of events
ne <- subset(ne, sev, select=c(id, trx, event))
## Replace event names with event labels
elab <- sapply(ssafety[aev], label)
ne$event <- elab[ne$event]
label(ne$trx) <- 'Treatment'

eReport(event ~ trx, data=ne)

∟ Proportion of adverse events by Treatment

Proportion of adverse events by Treatment sorted by descending risk difference

Category	N	Used
Enrolled	500	250
Randomized	250	250
A	81	81
B	169	169

Longitudinal EKG Data

dReport(axis + corr.qt + pr + qrs + uncorr.qt + hr ~ week + trx +
        id(id),
        groups='trx', data=ssafety, what='byx',
        popts=list(ncols=2, height=1300, width=1100))

∟ Medians with histograms for axis, corrected qt, pr, qrs, uncorrected qt, and ventricular rate vs. week stratified by treatment

Medians with histograms for axis, corrected qt, pr, qrs, uncorrected qt, and ventricular rate vs. week stratified by treatment. N=248 to 250

Category	N	Used
Enrolled	500	250
Randomized	250	250
A	81	81
B	169	169

Variable	A	B
axis	81	169
corrected qt	81	169
pr	81	167
qrs	81	169
uncorrected qt	81	169
ventricular rate	81	169

Longitudinal Clinical Chemistry Data

## Plot 6 variables per figure
cvar <- split(vars$chem, rep(letters[1:4], each=6))
form <- list()
for(sub in names(cvar)) {
  f <- paste(cvar[[sub]], collapse=' + ')
  form[[sub]] <- as.formula(paste(f, 'week + trx + id(id)', sep=' ~ '))
}
do <- function(form)
  dReport(form, groups='trx', data=ssafety,
          what='byx', 
          popts=list(ncols=2, height=1300, width=1100,
                     dhistboxp.opts=list(nmin=10, ff1=1.35)))
# Minimum of 10 observatins per x per group for histogram and quantiles
# to be drawn (default is nmin=5)
do(form$a)

∟ Medians with histograms for neutrophils absolute, alanine aminotransferase, albumin, alkaline phosphatase, aspartate aminotransferase, and basophils vs. week stratified by treatment

Medians with histograms for neutrophils absolute, alanine aminotransferase, albumin, alkaline phosphatase, aspartate aminotransferase, and basophils vs. week stratified by treatment. N=72 to 250

Category	N	Used
Enrolled	500	250
Randomized	250	250
A	81	81
B	169	169

Variable	A	B
neutrophils absolute	81	169
alanine aminotransferase	81	169
albumin	81	169
alkaline phosphatase	81	169
aspartate aminotransferase	81	169
basophils	21	51

do(form$b)

∟ Medians with histograms for total bilirubin, blood urea nitrogen, chloride, creatinine, eosinophils, and γ glutamyl transferase vs. week stratified by treatment

Medians with histograms for total bilirubin, blood urea nitrogen, chloride, creatinine, eosinophils, and γ glutamyl transferase vs. week stratified by treatment. N=72 to 250

Category	N	Used
Enrolled	500	250
Randomized	250	250
A	81	81
B	169	169

Variable	A	B
total bilirubin	81	169
blood urea nitrogen	81	169
chloride	81	169
creatinine	81	169
eosinophils	21	51
gamma glutamyl transferase	81	169

do(form$c)

∟ Medians with histograms for glucose - random, hematocrit, hemoglobin, potassium, lymphocytes, and monocytes vs. week stratified by treatment

Medians with histograms for glucose - random, hematocrit, hemoglobin, potassium, lymphocytes, and monocytes vs. week stratified by treatment. N=72 to 250

Category	N	Used
Enrolled	500	250
Randomized	250	250
A	81	81
B	169	169

Variable	A	B
glucose - random	81	163
hematocrit	81	169
hemoglobin	81	169
potassium	81	169
lymphocytes	21	51
monocytes	21	51

do(form$d)

∟ Medians with histograms for sodium, platelets, total protein, red blood cell count, uric acid, and white blood cell count vs. week stratified by treatment

Medians with histograms for sodium, platelets, total protein, red blood cell count, uric acid, and white blood cell count vs. week stratified by treatment. N=250

Category	N	Used
Enrolled	500	250
Randomized	250	250
A	81	81
B	169	169

Variable	A	B
sodium	81	169
platelets	81	169
total protein	81	169
red blood cell count	81	169
uric acid	81	169
white blood cell count	81	169

# dReport(wbc ~ week + trx + id(id), groups='trx', data=ssafety,
#         what='byx', popts=list(dhistboxp.opts=list(ff1=1.2)))

## Repeat last figure using quantile intervals instead of spike histograms
dReport(form$d, groups='trx', data=ssafety,
        what='byx', byx.type='quantiles',
        popts=list(ncols=2, height=1300, width=1100))

∟ Medians with quantile intervals for sodium, platelets, total protein, red blood cell count, uric acid, and white blood cell count vs. week stratified by treatment

Medians with quantile intervals for sodium, platelets, total protein, red blood cell count, uric acid, and white blood cell count vs. week stratified by treatment. N=250

Category	N	Used
Enrolled	500	250
Randomized	250	250
A	81	81
B	169	169

Variable	A	B
sodium	81	169
platelets	81	169
total protein	81	169
red blood cell count	81	169
uric acid	81	169
white blood cell count	81	169

Time to Hospitalization and Surgery

set.seed(1)
n <- 400
dat <- data.frame(t1=runif(n, 2, 5), t2=runif(n, 2, 5),
                  e1=rbinom(n, 1, .5), e2=rbinom(n, 1, .5),
                  cr1=factor(sample(c('cancer','heart','censor'), n, TRUE),
                             c('censor', 'cancer', 'heart')),
                  cr2=factor(sample(c('gastric','diabetic','trauma', 'censor'),
                                    n, TRUE),
                             c('censor', 'diabetic', 'gastric', 'trauma')),
                  treat=sample(c('a','b'), n, TRUE))
dat <- upData(dat,
              labels=c(t1='Time to operation',
                       t2='Time to rehospitalization',
                       e1='Operation', e2='Hospitalization',
                       treat='Treatment'),
              units=c(t1='Year', t2='Year'), print=FALSE)
denom <- c(enrolled=n + 40, randomized=400, a=sum(dat$treat=='a'),
           b=sum(dat$treat=='b'))
if(FALSE) {
sethreportOption(denom=denom, tx.var='treat')
survReport(Surv(t1, e1) + Surv(t2, e2) ~ treat, data=dat, what='S')
# Show estimates combining treatments
survReport(Surv(t1, e1) + Surv(t2, e2) ~ 1, data=dat,
           what='S', times=3, ylim=c(.1, 1))

# Same but use multiple figures and use 1 - S(t) scale
survReport(Surv(t1, e1) + Surv(t2, e2) ~ treat, data=dat,
           multi=TRUE, what='1-S',
           times=3:4, aehaz=FALSE)

survReport(Surv(t1, e1) + Surv(t2, e2) ~ 1, data=dat,
           multi=TRUE, what='1-S', y.n.risk=-.02)
}

Computing Environment

These analyses were done using the following versions of R⁵, the operating system, and add-on packages hreport, Hmisc⁶, rms⁷, and others:

R version 4.0.2 (2020-06-22)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Pop!_OS 20.04 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] hreport_0.5-0     data.table_1.13.0 Hmisc_4.4-2       ggplot2_3.3.2    
[5] Formula_1.2-3     survival_3.2-7    lattice_0.20-41  

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.5          mvtnorm_1.1-1       tidyr_1.1.2        
 [4] zoo_1.8-8           png_0.1-7           digest_0.6.25      
 [7] R6_2.4.1            backports_1.1.10    MatrixModels_0.4-1 
[10] evaluate_0.14       httr_1.4.2          pillar_1.4.6       
[13] rlang_0.4.7         multcomp_1.4-14     lazyeval_0.2.2     
[16] rstudioapi_0.11     SparseM_1.78        rpart_4.1-15       
[19] Matrix_1.2-18       checkmate_2.0.0     rmarkdown_2.4      
[22] splines_4.0.2       stringr_1.4.0       foreign_0.8-79     
[25] htmlwidgets_1.5.1   munsell_0.5.0       compiler_4.0.2     
[28] xfun_0.18           pkgconfig_2.0.3     base64enc_0.1-3    
[31] htmltools_0.5.0     nnet_7.3-14         tidyselect_1.1.0   
[34] tibble_3.0.3        gridExtra_2.3       htmlTable_2.1.0    
[37] bookdown_0.20       codetools_0.2-16    rms_6.0-2          
[40] matrixStats_0.57.0  viridisLite_0.3.0   crayon_1.3.4       
[43] dplyr_1.0.2         conquer_1.0.2       withr_2.3.0        
[46] MASS_7.3-53         grid_4.0.2          nlme_3.1-149       
[49] polspline_1.1.19    jsonlite_1.7.1      gtable_0.3.0       
[52] lifecycle_0.2.0     magrittr_1.5        scales_1.1.1       
[55] rmdformats_0.3.7    stringi_1.5.3       farver_2.0.3       
[58] latticeExtra_0.6-29 ellipsis_0.3.1      generics_0.0.2     
[61] vctrs_0.3.4         sandwich_3.0-0      TH.data_1.0-10     
[64] RColorBrewer_1.1-2  tools_4.0.2         glue_1.4.2         
[67] purrr_0.3.4         crosstalk_1.1.0.1   jpeg_0.1-8.1       
[70] yaml_2.2.1          colorspace_1.4-1    cluster_2.1.0      
[73] plotly_4.9.2.1      knitr_1.30          quantreg_5.73

The reproducible research framework knitr⁸ was used.

Programming

Methods

This report was produced using high-quality open source, freely available R packages. High-level R graphics and html making functions in FE Harrell’s Hmisc package were used in the context of the R knitr package and RStudio with Rmarkdown. A new R package hreport contains functions accrualReport, dReport, exReport, eReport, and survReport using the philosophy of program-controlled generation of html and markdown text, figures, and tables. When figures were plotted in R, figure legends were automatically generated.

The entire process is best managed by creating a single .Rmd file that is executed using the knitr package in R.

Data Preparation

Variable labels are used in much of the graphical and tabular output, so it is advisable to attach label attributes to almost all variables. Variable names are used when labels are not defined. Units of measurement also appear in the output, so most continuous variables should have a units attribute. The units may contain mathematical expressions such as cm^2 which will be properly typeset in tables and plots, using superscripts, subscripts, etc. Variables that are not binary (0/1, Y/N, etc.) but are categorical should have levels (value labels) defined (e.g., using the factor function) that will be attractive in the report. The Hmisc library upData function is useful for annotating variables with labels, units of measurement, and value labels. See Alzola and Harrell, 2006, this, and this for details about setting up analysis files.

R code that created the analysis file for this report is in the inst/tests directory of the hreport package source. For this particular application, units and some of the labels were actually obtained from separate data tables as shown in the code.

Data Assumptions

Non-randomized subjects are marked by missing data of randomization
The treatment variable is always the same for every dataset and is defined in tx.var on sethreportOption.
For some graphics there must be either no treatment variable or exactly two treatment levels.
If there are treatments the design is a parallel-dReport(age + group design.
Whenever a dataset is specified to one of the hreport functions and subject have repeated measurements (\(>1\) record), an id variable must be given.

References

Harrell, Frank E. “Hmisc: A Package of Miscellaneous R Functions,” 2020. https://hbiostat.org/R/Hmisc.

———. “rms: R Functions for Biostatistical/Epidemiologic Modeling, Testing, Estimation, Validation, Graphics, Prediction, and Typesetting by Storing Enhanced Model Design Attributes in the Fit,” 2020. https://hbiostat.org/R/rms.

R Development Team. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing, 2020. http://www.R-project.org.

Xie, Yihui. Dynamic Documents with R and Knitr, Second Edition. Second. Chapman and Hall, 2015.

This also facilitates multivariate understanding of trends and differences. For example, if one sorted countries by the fraction of subjects who died and displayed also the fraction of subjects who suffered a stroke, the extent to which stroke incidence is also sorted by country is a measure of the correlation between mortality and stroke incidence across countries.↩︎
In particular, the standard deviation is not very meaningful for asymmetric distributions, and is not robust to outliers.↩︎
Lan-DeMets monitoring bounds can be plotted using the open source R gsDesign package.↩︎
See Ellenberg, Fleming, and DeMets, Data Monitoring Committees in Clinical Trials (Wiley, 2002), pp. 73-74 for recommended components in open and closed data monitoring committee reports.↩︎
R Development Team, R.↩︎
Harrell, “Hmisc.”↩︎
Harrell, “rms.”↩︎
Xie, Dynamic Documents with R and Knitr, Second Edition.↩︎

Example Closed Meeting Data Monitoring Committee Report