flowchart LR des[describe Function] > ss[Statistical Summary] & dsh[Spike Histogram] cat[Categorical Variables] > dot[Frequency Dot Charts] con[Continuous Variables] > sh[Spike Histograms] & ebp[Extended Box Plots] lon[Longitudinal Data] > sg[Spaghetti Graphs] & rc[Representative Curves] & ol[Ordinal Transitions<br>and States] ev[Events] > mc[Multicategory Event Charts] & tl[Timelines] rel[Relationships] > cm[Graphical Correlation Matrix] & vc[Variable Clustering]
9 Descriptive Statistics
The Hmisc
describe
function is my main tool for getting initial descriptive statistics and quality controlling the data in a univariate fashion. Here is an example. The Info
index is a measure of the information content in a numeric variable relative to the information in a continuous numeric variable with no ties. A very low value of Info
will occur when a highly imbalanced variable is binary. Clicking on Glossary
on the right will pop up a browser window with a more indepth glossary of terms used in Hmisc
package output. It links to hbiostat.org/R/glossary.html
which you can link from your reports that use Hmisc
.
Info
comes from the approximate formula for the variance of a log odds ratio for a proportional odds model/Wilcoxon test, due to Whitehead.Glossary
Gmd
in the output stands for Gini’s mean difference—the mean absolute difference over all possible pairs of different observations. It is a very interpretable measure of dispersion that is more robust than the standard deviation.
require(Hmisc)
options(prType='html') # have certain Hmisc functions render in html
require(qreport)
hookaddcap() # make knitr call a function at the end of each chunk
# to try to automatically add to list of figure
getHdata(stressEcho)
< stressEcho d
describe
Output
# The callout was typed manually; could have run
# makecnote(~ describe(d), wide=TRUE)
< describe(d)
w w
31 Variables 558 Observations
bhr: Basal heart rate bpm
n  missing  distinct  Info  Mean  Gmd  .05  .10  .25  .50  .75  .90  .95 

558  0  69  0.999  75.29  16.57  54.0  58.0  64.0  74.0  84.0  95.3  102.0 
basebp: Basal blood pressure mmHg
n  missing  distinct  Info  Mean  Gmd  .05  .10  .25  .50  .75  .90  .95 

558  0  94  0.998  135.3  23.35  104.0  110.0  120.0  133.0  150.0  162.3  170.1 
basedp: Basal Double Product bhr*basebp bpm*mmHg
n  missing  distinct  Info  Mean  Gmd  .05  .10  .25  .50  .75  .90  .95 

558  0  441  1  10181  2813  6607  7200  8400  9792  11663  13610  14770 
pkhr: Peak heart rate mmHg
n  missing  distinct  Info  Mean  Gmd  .05  .10  .25  .50  .75  .90  .95 

558  0  105  1  120.6  25.36  81.85  90.70  106.25  122.00  135.00  147.00  155.15 
sbp: Systolic blood pressure mmHg
n  missing  distinct  Info  Mean  Gmd  .05  .10  .25  .50  .75  .90  .95 

558  0  142  1  146.9  40.72  96  102  120  141  170  200  210 
dp: Double product pkhr*sbp bpm*mmHg
n  missing  distinct  Info  Mean  Gmd  .05  .10  .25  .50  .75  .90  .95 

558  0  508  1  17634  5765  10256  11341  14033  17060  20644  24536  26637 
dose: Dose of dobutamine given mg
n  missing  distinct  Info  Mean  Gmd 

558  0  7  0.84  33.75  8.334 
Value 10.0 14.8 19.9 25.0 29.8 34.9 40.0 Frequency 2 28 47 56 64 61 300 Proportion 0.004 0.050 0.084 0.100 0.115 0.109 0.538For the frequency table, variable is rounded to the nearest 0.3
maxhr: Maximum heart rate bpm
n  missing  distinct  Info  Mean  Gmd  .05  .10  .25  .50  .75  .90  .95 

558  0  103  1  119.4  24.64  82.0  91.0  104.2  120.0  133.0  146.0  154.1 
pctMphr: Percent maximum predicted heart rate achieved %
n  missing  distinct  Info  Mean  Gmd  .05  .10  .25  .50  .75  .90  .95 

558  0  78  0.999  78.57  16.86  53  60  69  78  88  97  104 
mbp: Maximum blood pressure mmHg
n  missing  distinct  Info  Mean  Gmd  .05  .10  .25  .50  .75  .90  .95 

558  0  132  0.999  156  35.03  110.0  120.0  133.2  150.0  175.8  200.0  211.1 
dpmaxdo: Double product on max dobutamine dose bpm*mmHg
n  missing  distinct  Info  Mean  Gmd  .05  .10  .25  .50  .75  .90  .95 

558  0  484  1  18550  5385  11346  12865  15260  18118  21239  24893  27477 
dobdose: Dobutamine dose at max double product mg
n  missing  distinct  Info  Mean  Gmd 

558  0  8  0.941  30.24  10.55 
Value 5.00 9.90 14.80 19.70 24.95 29.85 34.75 40.00 Frequency 7 7 55 73 71 78 62 205 Proportion 0.013 0.013 0.099 0.131 0.127 0.140 0.111 0.367For the frequency table, variable is rounded to the nearest 0.35
age: Age years
n  missing  distinct  Info  Mean  Gmd  .05  .10  .25  .50  .75  .90  .95 

558  0  62  0.999  67.34  13.41  46.85  51.00  60.00  69.00  75.00  82.00  85.00 
gender
n  missing  distinct 

558  0  2 
Value male female Frequency 220 338 Proportion 0.394 0.606
baseEF: Baseline cardiac ejection fraction %
n  missing  distinct  Info  Mean  Gmd  .05  .10  .25  .50  .75  .90  .95 

558  0  54  0.994  55.6  10.71  32  40  52  57  62  65  66 
dobEF: Ejection fraction on dobutamine %
n  missing  distinct  Info  Mean  Gmd  .05  .10  .25  .50  .75  .90  .95 

558  0  60  0.992  65.24  12.38  40.0  49.7  62.0  67.0  73.0  76.0  80.0 
chestpain: Chest pain
n  missing  distinct  Info  Sum  Mean  Gmd 

558  0  2  0.64  172  0.3082  0.4272 
restwma: Resting wall motion abnormality on echocardiogram
n  missing  distinct  Info  Sum  Mean  Gmd 

558  0  2  0.745  257  0.4606  0.4978 
posSE: Positive stress echocardiogram
n  missing  distinct  Info  Sum  Mean  Gmd 

558  0  2  0.553  136  0.2437  0.3693 
newMI: New myocardial infarction
n  missing  distinct  Info  Sum  Mean  Gmd 

558  0  2  0.143  28  0.05018  0.09549 
newPTCA: Recent angioplasty
n  missing  distinct  Info  Sum  Mean  Gmd 

558  0  2  0.138  27  0.04839  0.09226 
newCABG: Recent bypass surgery
n  missing  distinct  Info  Sum  Mean  Gmd 

558  0  2  0.167  33  0.05914  0.1115 
death
n  missing  distinct  Info  Sum  Mean  Gmd 

558  0  2  0.123  24  0.04301  0.08247 
hxofHT: History of hypertension
n  missing  distinct  Info  Sum  Mean  Gmd 

558  0  2  0.625  393  0.7043  0.4173 
hxofDM: History of diabetes
n  missing  distinct  Info  Sum  Mean  Gmd 

558  0  2  0.699  206  0.3692  0.4666 
hxofCig: History of smoking
n  missing  distinct 

558  0  3 
Value heavy moderate nonsmoker Frequency 122 138 298 Proportion 0.219 0.247 0.534
hxofMI: History of myocardial infarction
n  missing  distinct  Info  Sum  Mean  Gmd 

558  0  2  0.599  154  0.276  0.4004 
hxofPTCA: History of angioplasty
n  missing  distinct  Info  Sum  Mean  Gmd 

558  0  2  0.204  41  0.07348  0.1364 
hxofCABG: History of coronary artery bypass surgery
n  missing  distinct  Info  Sum  Mean  Gmd 

558  0  2  0.399  88  0.1577  0.2661 
any.event: Death, newMI, newPTCA, or newCABG
n  missing  distinct  Info  Sum  Mean  Gmd 

558  0  2  0.402  89  0.1595  0.2686 
ecg: Baseline electrocardiogram diagnosis
n  missing  distinct 

558  0  3 
Value normal equivocal MI Frequency 311 176 71 Proportion 0.557 0.315 0.127
# To create a separate browser window:
cat(html(w), file='desc.html')
browseURL('desc.html', browser='firefox newwindow')
Better, whether using RStudio
or not:
htmlView(w, contents(d)) # or htmlView(describe(d1), describe(d2), ...)
# Use htmlViewx to use an external browser window (see above)
Or just type contents(d)
with options(prType='html')
in effect. This will launch a browser window if you are not in RStudio
.
Nicer output from describe
is obtained by separating categorical and continuous variables. The special print
method used here, with options(prType='html')
in effect, creates interactive sparklines so that variable values and frequencies can be looked up in spike histograms without using plotly
graphics but using more basic javascript
. Hover over the lowest and highest points on the histograms to see the 5 most extreme data values. Note that to initialize javascript
jQuery
dependencies we must include sparkline::sparkline(0)
somewhere in the report before creating the first sparkline output. The qreport
maketabs
function is helpful here.
::sparkline(0) sparkline
maketabs(print(w, 'both'), wide=TRUE, initblank=TRUE)
d Descriptives 

13 Continous Variables of 31 Variables, 558 Observations  
Variable  Label  Units  n  Missing  Distinct  Info  Mean  Gini Δ  Quantiles .05 .10 .25 .50 .75 .90 .95 


bhr  Basal heart rate  bpm  558  0  69  0.999  75.29  16.57  54.0 58.0 64.0 74.0 84.0 95.3 102.0  
basebp  Basal blood pressure  mmHg  558  0  94  0.998  135.3  23.35  104.0 110.0 120.0 133.0 150.0 162.3 170.1  
basedp  Basal Double Product bhr*basebp  bpm*mmHg  558  0  441  1.000  10181  2813  6607 7200 8400 9792 11663 13610 14770  
pkhr  Peak heart rate  mmHg  558  0  105  1.000  120.6  25.36  81.85 90.70 106.25 122.00 135.00 147.00 155.15  
sbp  Systolic blood pressure  mmHg  558  0  142  1.000  146.9  40.72  96 102 120 141 170 200 210  
dp  Double product pkhr*sbp  bpm*mmHg  558  0  508  1.000  17634  5765  10256 11341 14033 17060 20644 24536 26637  
maxhr  Maximum heart rate  bpm  558  0  103  1.000  119.4  24.64  82.0 91.0 104.2 120.0 133.0 146.0 154.1  
pctMphr  Percent maximum predicted heart rate achieved  %  558  0  78  0.999  78.57  16.86  53 60 69 78 88 97 104  
mbp  Maximum blood pressure  mmHg  558  0  132  0.999  156  35.03  110.0 120.0 133.2 150.0 175.8 200.0 211.1  
dpmaxdo  Double product on max dobutamine dose  bpm*mmHg  558  0  484  1.000  18550  5385  11346 12865 15260 18118 21239 24893 27477  
age  Age  years  558  0  62  0.999  67.34  13.41  46.85 51.00 60.00 69.00 75.00 82.00 85.00  
baseEF  Baseline cardiac ejection fraction  %  558  0  54  0.994  55.6  10.71  32 40 52 57 62 65 66  
dobEF  Ejection fraction on dobutamine  %  558  0  60  0.992  65.24  12.38  40.0 49.7 62.0 67.0 73.0 76.0 80.0 
d Descriptives 

18 Categorical Variables of 31 Variables, 558 Observations  
Variable  Label  Units  n  Missing  Distinct  Info  Sum  Mean  Gini Δ  

dose  Dose of dobutamine given  mg  558  0  7  0.840  33.75  8.334  
dobdose  Dobutamine dose at max double product  mg  558  0  8  0.941  30.24  10.55  
gender  558  0  2  
chestpain  Chest pain  558  0  2  0.640  172  0.3082  0.4272  
restwma  Resting wall motion abnormality on echocardiogram  558  0  2  0.745  257  0.4606  0.4978  
posSE  Positive stress echocardiogram  558  0  2  0.553  136  0.2437  0.3693  
newMI  New myocardial infarction  558  0  2  0.143  28  0.05018  0.09549  
newPTCA  Recent angioplasty  558  0  2  0.138  27  0.04839  0.09226  
newCABG  Recent bypass surgery  558  0  2  0.167  33  0.05914  0.1115  
death  558  0  2  0.123  24  0.04301  0.08247  
hxofHT  History of hypertension  558  0  2  0.625  393  0.7043  0.4173  
hxofDM  History of diabetes  558  0  2  0.699  206  0.3692  0.4666  
hxofCig  History of smoking  558  0  3  
hxofMI  History of myocardial infarction  558  0  2  0.599  154  0.276  0.4004  
hxofPTCA  History of angioplasty  558  0  2  0.204  41  0.07348  0.1364  
hxofCABG  History of coronary artery bypass surgery  558  0  2  0.399  88  0.1577  0.2661  
any.event  Death, newMI, newPTCA, or newCABG  558  0  2  0.402  89  0.1595  0.2686  
ecg  Baseline electrocardiogram diagnosis  558  0  3 
There is also a plot
method for describe
output. It produces two graphics objects: one for categorical variables and one for continuous variables. The default is to use ggplot2
to produce static graphics. The result can be fed directly into maketabs
described earlier. results='asis'
must appear in the chunk header.
< 'Regular `plot(describe)` output'
cap maketabs(plot(w, bvspace=2.5), basecap=cap, cap=1)
By specifying grType
option you can instead get plotly
graphics that use hover text to show more information, especially when hovering over the leftmost dot or tick mark for a variable.
options(grType='plotly')
< '`plotly` `plot(describe)` output'
cap maketabs(plot(w, bvspace=2.5), wide=TRUE, cap=1, basecap=cap)
See this for other Hmisc
functions for descriptive graphics and tables, especially for stratified descriptive statistics for categorical variables. The summaryM
function prints a tabular summary of a mix of continuous and categorical variables. Here is an example where stratification is by history of myocardial infarction (MI).
require(data.table)
setDT(d) # turn d into a data table
# tables() with no arguments will give a concise summary of all active data tables
< d
w := factor(hxofMI, 0 : 1, c('No history of MI', 'History of MI'))]
w[, hxofMI < setdiff(names(d), 'hxofMI')
vars < as.formula(paste(paste(vars, collapse='+'), '~ hxofMI'))
form print(form)
bhr + basebp + basedp + pkhr + sbp + dp + dose + maxhr + pctMphr + mbp + dpmaxdo + dobdose + age + gender + baseEF + dobEF + chestpain + restwma + posSE + newMI + newPTCA + newCABG + death + hxofHT + hxofDM + hxofCig + hxofPTCA + hxofCABG + any.event + ecg ~ hxofMI
< summaryM(form, data=d, test=TRUE)
s # Note: there is a problem with the width of the categorical variable
# plot. Neither fig.size() nor options(plotlyauto=FALSE) fixed it.
maketabs(
` ` ~ ` `, # empty tab
`Table 1` ~ print(s, exclude1=TRUE, npct='both', digits=3, middle.bold=TRUE),
`Categorical Variable Plot` ~ plot(s, which='categorical', vars=1 : 4) +
caption('`summaryM` plots') +
fig.size(width=6),
`Continuous Variable Plot` ~ plot(s, which='continuous', vars=1 : 4),
wide=TRUE)
Descriptive Statistics (N=558).  

No history of MI N=404 
History of MI N=154 
Test Statistic 

Basal heart rate
bpm

65 74 85  63 72 84  F_{1 556}=1.41, P=0.235^{1} 
Basal blood pressure
mmHg

120 134 150  120 130 150  F_{1 556}=1.39, P=0.238^{1} 
Basal Double Product bhr*basebp
bpm*mmHg

8514 9874 11766  8026 9548 11297  F_{1 556}=3.32, P=0.069^{1} 
Peak heart rate
mmHg

107 123 136  104 120 132  F_{1 556}=2.35, P=0.126^{1} 
Systolic blood pressure
mmHg

124 146 174  115 134 158  F_{1 556}=12.1, P<0.001^{1} 
Double product pkhr*sbp
bpm*mmHg

14520 17783 21116  13198 15539 18885  F_{1 556}=15, P<0.001^{1} 
Dose of dobutamine given
mg
: 10 
0.00 ^{2}⁄_{404}  0.00 ^{0}⁄_{154}  χ^{2}_{6}=8.77, P=0.187^{2} 
15  0.05 ^{21}⁄_{404}  0.05 ^{7}⁄_{154}  
20  0.10 ^{40}⁄_{404}  0.05 ^{7}⁄_{154}  
25  0.11 ^{45}⁄_{404}  0.07 ^{11}⁄_{154}  
30  0.11 ^{43}⁄_{404}  0.14 ^{21}⁄_{154}  
35  0.11 ^{45}⁄_{404}  0.10 ^{16}⁄_{154}  
40  0.51 ^{208}⁄_{404}  0.60 ^{92}⁄_{154}  
Maximum heart rate
bpm

107 122 134  102 118 130  F_{1 556}=4.05, P=0.045^{1} 
Percent maximum predicted heart rate achieved
%

69.0 78.0 89.0  70.0 77.0 87.5  F_{1 556}=0.5, P=0.479^{1} 
Maximum blood pressure
mmHg

138 154 180  130 142 162  F_{1 556}=13, P<0.001^{1} 
Double product on max dobutamine dose
bpm*mmHg

15654 18666 21664  14489 16785 19680  F_{1 556}=16.1, P<0.001^{1} 
Dobutamine dose at max double product
mg
: 5 
0.01 ^{4}⁄_{404}  0.02 ^{3}⁄_{154}  χ^{2}_{7}=8.5, P=0.29^{2} 
10  0.01 ^{6}⁄_{404}  0.01 ^{1}⁄_{154}  
15  0.11 ^{43}⁄_{404}  0.08 ^{12}⁄_{154}  
20  0.14 ^{58}⁄_{404}  0.10 ^{15}⁄_{154}  
25  0.13 ^{54}⁄_{404}  0.11 ^{17}⁄_{154}  
30  0.13 ^{51}⁄_{404}  0.18 ^{27}⁄_{154}  
35  0.10 ^{40}⁄_{404}  0.14 ^{22}⁄_{154}  
40  0.37 ^{148}⁄_{404}  0.37 ^{57}⁄_{154}  
Age
years

59.0 68.0 75.0  63.2 71.0 76.8  F_{1 556}=9.75, P=0.002^{1} 
gender : female  0.68 ^{273}⁄_{404}  0.42 ^{65}⁄_{154}  χ^{2}_{1}=30, P<0.001^{2} 
Baseline cardiac ejection fraction
%

55 59 63  40 54 60  F_{1 556}=56.4, P<0.001^{1} 
Ejection fraction on dobutamine
%

65.0 70.0 74.2  50.0 64.5 70.0  F_{1 556}=50.3, P<0.001^{1} 
Chest pain  0.29 ^{119}⁄_{404}  0.34 ^{53}⁄_{154}  χ^{2}_{1}=1.29, P=0.257^{2} 
Resting wall motion abnormality on echocardiogram  0.57 ^{230}⁄_{404}  0.18 ^{27}⁄_{154}  χ^{2}_{1}=69.7, P<0.001^{2} 
Positive stress echocardiogram  0.21 ^{86}⁄_{404}  0.32 ^{50}⁄_{154}  χ^{2}_{1}=7.56, P=0.006^{2} 
New myocardial infarction  0.03 ^{14}⁄_{404}  0.09 ^{14}⁄_{154}  χ^{2}_{1}=7.4, P=0.007^{2} 
Recent angioplasty  0.02 ^{10}⁄_{404}  0.11 ^{17}⁄_{154}  χ^{2}_{1}=17.8, P<0.001^{2} 
Recent bypass surgery  0.05 ^{21}⁄_{404}  0.08 ^{12}⁄_{154}  χ^{2}_{1}=1.35, P=0.246^{2} 
death  0.04 ^{15}⁄_{404}  0.06 ^{9}⁄_{154}  χ^{2}_{1}=1.23, P=0.267^{2} 
History of hypertension  0.69 ^{280}⁄_{404}  0.73 ^{113}⁄_{154}  χ^{2}_{1}=0.89, P=0.346^{2} 
History of diabetes  0.36 ^{147}⁄_{404}  0.38 ^{59}⁄_{154}  χ^{2}_{1}=0.18, P=0.674^{2} 
History of smoking : heavy  0.21 ^{83}⁄_{404}  0.25 ^{39}⁄_{154}  χ^{2}_{2}=3.16, P=0.206^{2} 
moderate  0.24 ^{96}⁄_{404}  0.27 ^{42}⁄_{154}  
nonsmoker  0.56 ^{225}⁄_{404}  0.47 ^{73}⁄_{154}  
History of angioplasty  0.04 ^{15}⁄_{404}  0.17 ^{26}⁄_{154}  χ^{2}_{1}=28.4, P<0.001^{2} 
History of coronary artery bypass surgery  0.08 ^{34}⁄_{404}  0.35 ^{54}⁄_{154}  χ^{2}_{1}=59.6, P<0.001^{2} 
Death, newMI, newPTCA, or newCABG  0.12 ^{48}⁄_{404}  0.27 ^{41}⁄_{154}  χ^{2}_{1}=18.1, P<0.001^{2} 
Baseline electrocardiogram diagnosis : normal  0.59 ^{240}⁄_{404}  0.46 ^{71}⁄_{154}  χ^{2}_{2}=8, P=0.018^{2} 
equivocal  0.29 ^{117}⁄_{404}  0.38 ^{59}⁄_{154}  
MI  0.12 ^{47}⁄_{404}  0.16 ^{24}⁄_{154}  
a b c represent the lower quartile a, the median b, and the upper quartile c for continuous variables. Tests used: ^{1}Wilcoxon test; ^{2}Pearson test . 
Semiinteractive stratified spike histograms are also useful descriptive plots. These plots also contain a superset of the quantiles used in box plots, and the legend is clickable, allowing any of the statistical summaries to be turned off.
histboxp(x=maxhr, group=ecg, bins=200)] d[,
9.1 Longitudinal Continuous Y
For a continuous response variable measured longitudinally, two of the ways to display the data are
 “spaghetti plots” showing all data for all individual subjects, if the number of subjects is not very large
 finding clusters of individual subject curve characteristics and plotting a random sample of raw data curves within each cluster if the sample size is large
The Hmisc
package curveRep
function facilitates the latter approach using representative curves. It considers persubject sample sizes and measurement time gaps as these will not only limit how we look at the data but may be informative, e.g., subjects who are failing may start to get more frequent measurements.
To demonstrate we simulate data in the following way.
 Simulate 200 subjects (curves) with percurve sample sizes ranging from 1 to 10
 Make curves with oddnumbered IDs have a measurement time distribution that is random uniform [0,1] and those with evennumbered IDs have a time distribution that is half as wide but still centered at 0.5. Shift y values higher with increasing IDs
 Make 1/3 of subjects have a flat trajectory, 1/3 linear and not flat, and 1/3 quadratic
Request curveRep
to cluster the 200 curves on the following characteristics:
kn=3
sample size groups (curveRep
actually used one more than this)kxdist=2
time point distribution groups based here on the earliest and latest measurement times and the longest gap between any two measurements within a subjectk=4
trajectory clusters. Trajectories are determined byloess
smoothing each subject’s curve and linearly interpolating the estimates to the same evenlyspaced grid of time points, then clustering on these 10 yestimates. This captures intercepts, slopes, and shapes.
All subjects within each cluster are shown; we didn’t need to take random samples for 200 subjects. Results for the 4 percurve sample size groups are placed in separate tabs.
set.seed(1)
< 200
N < sample(1:10, N, TRUE)
nc < rep(1:N, nc)
id < y < id
x # Solve for coefficients of quadratic function such that it agrees with
# the linear function at x=0.25, 0.75 and is lower by delta at x=0.5
< 3
delta < list(c(0, 0, 0), c(0, 10, 0), c(delta, 10, 2 * delta / (2 * 0.25^2)))
cof
for(i in 1 : N) {
==i] < if(i %% 2) runif(nc[i]) else runif(nc[i], c(.25, .75))
x[id< sample(1:3, 1)
shape < x[id == i]  0.5
xc < cof[[shape]]
k == i] < i/20 + k[1] + k[2] * xc + k[3] * xc ^ 2 +
y[id runif(nc[i], 2.5, 2.5)
}require(cluster)
< curveRep(x, y, id, kn=3, kxdist=2, k=4, p=10)
w < vector('list', 4)
gg < rep('', 4)
nam for(i in 1 : 4) {
< plot(w, i, method='data') # method='data' available in Hmisc 4.71
z < transform(z,
z distribution = paste('Time Distribution:', distribution),
cluster = paste('Trajectory Cluster:', cluster))
<
gg[[i]] if(i == 1)
ggplot(z, aes(x, y, color=curve)) + geom_point() +
facet_grid(distribution ~ cluster) +
theme(legend.position='none')
else
ggplot(z, aes(x, y, color=curve)) + geom_line() +
facet_grid(distribution ~ cluster) +
theme(legend.position='none')
< z$ninterval[1]
nam[i]
}names(gg) < nam
maketabs(gg, basecap='Representative curves determined by `curveRep` stratified by persubject sample size ranges', cap=1)
9.2 Longitudinal Ordinal Y
Continuous longitudinal Y may be analyzed flexibly using semiparametric models while being described using representative curves as just discussed. Now suppose that Y is discrete. There are three primary ways of describing such data graphically.
 Show event occurrences/trajectories for a random sample of subjects (
Hmisc::multEventChart
)  Show all transition proportions if time is discrete (
Hmisc::propsTrans
)  Show all state occupancy proportions if time is discrete (
Hmisc::propsPO
)
multEventChart
Starting with a multiple event chart, simulate data on 5 patients, then display all the raw data.
< data.frame(pt=1, day=0:3,
pt1 status=.q(well, well, sick, 'very sick'))
< data.frame(pt=2, day=c(1,2,4,6),
pt2 status=.q(sick, 'very sick', coma, death))
< data.frame(pt=3, day=1:5,
pt3 status=.q(sick, 'very sick', sick, 'very sick', 'discharged'))
< data.frame(pt=4, day=c(1:4, 10),
pt4 status=.q(well, sick, 'very sick', well, discharged))
< rbind(pt1, pt2, pt3, pt4)
d < upData(d,
d status = factor(status, .q(discharged, well, sick,
'very sick', coma, death)),
labels = c(day = 'Day'), print=FALSE)
kabl(pt1, pt2, pt3, pt4)




multEventChart(status ~ day + pt, data=d,
absorb=.q(death, discharged),
colorTitle='Status', sortbylast=TRUE) +
theme_classic() + theme(legend.position='bottom')
For an example of displaying transition proportions, whereby all the outcome information in the raw data is shown, simulate some random data. The result is a plotly
graphic over which you hover the pointer to see details. The size of the bubbles is proportional to the proportion in that transition.
set.seed(1)
< expand.grid(id=1:30, time=1:7)
d setDT(d) # convert to data.table
:= sample(c('female', 'male'), .N, replace=TRUE)]
d[, sex := sample(LETTERS[1:4], .N, replace=TRUE)]
d[, state ggplotlyr(propsTrans(state ~ time + id, data=d))
The final display doesn’t capture the withinsubject correlation as done with the transition proportions, but is the most familiar display for longitudinal ordinal data as it shows proportions in the current states, which are cumulative incidence estimates for absorbing states.
propsPO
to work properly, even though the real data file would terminate followup at an absorbing event.ggplotlyr(propsPO(state ~ time + sex, data=d))
9.3 Adverse Event Chart
When there is a potentially large number of event types, such as adverse events (AEs) in a clinical trial, and the event timing is not considered, a dot chart is an excellent way to present the proportion of subjects suffering each type of AE. The AEs can be sorted in descending order by the difference in proportions between treatments, and plotly
hover text can display more details. Halfwidth confidence intervals are used (see Section 15.6). An AE chart is easily produced using the aePlot
function in qreport
. aePlot
expects the dataset to have one record per subject per AE, so the dataset itself does not define the proper denominator and this must be specified by the user (see denom
below). The color coded needles in the right margin are guideposts to which denominators are being used in the analysis (details are here).
getHdata(aeTestData) # original data source: HH package
# One record per subject per adverse event
# For this example, the denominators for the two treatments in the
# popup needles will be incorrect because the dataset did not have
# subject IDs.
< aeTestData
ae # qreport requires us to define official clinical trial counts and
# name of treatment variable
< c(enrolled = 1000,
denom randomized = 400,
a=212, b=188)
setqreportOption(tx.var='treat', denom=denom)
aePlot(event ~ treat, data=ae, minincidence=.05, size='wide')
Category  N  Used 

enrolled  1000  400 
randomized  400  400 
a  212  212 
b  188  188 
9.4 Continuous Event Times
For timetoevent data with possibly multiple types of events, an event chart is a good way to show the raw outcome data for a sample of up to perhaps 40 subjects. The Hmisc
package offers the event.chart
function written by Jack Lee, Kenneth Hess, and Joel Dubin. Here is an example they provided. Patients are sorted by diagnosis date.
getHdata(cdcaids)
event.chart(cdcaids,
subset.c=.q(infedate, diagdate, dethdate, censdate),
x.lab = 'Observation Dates',
y.lab='Patients',
titl='AIDS Data Calendar Event Chart',
point.pch=c(1,2,15,0), point.cex=c(1,1,0.8,0.8),
legend.plot=TRUE, legend.location='i', legend.cex=0.8,
legend.point.text=.q(transfusion,'AIDS diagnosis',death,censored),
legend.point.at = list(c(7210, 8100), c(35, 27)))
9.5 Describing Variable Interrelationships
The most basic way to examine interrelationships among variables is to graphically depict a correlation matrix. Below is an example on support
using the Spearman’s \(\rho\) rank correlation coefficient. Another good descriptive analysis to help understand relationships among variables and redundancies/collinearities is variable clustering. Here one clusters on variables instead of observations, and instead of a distance matrix we have a similarly matrix. A good default similarity matrix is based on the square of \(\rho\). In order to restrict ourselves to unsupervised learning, also called data reduction, we restrict attention to nonoutcome variables in both displays. The vClus
function in qreport
runs the dataset, after excluding some variables, through the Hmisc
dataframeReduce
function to eliminate variables that are missing more than 0.4 of the time and to ignore character or factor variables having more than 10 levels. Binary variables having prevalence < 0.05 are dropped, and categorical variables having < 0.05 of their nonmissing values in a category will have such lowfrequency categories combined into an “other” category for purposes of computing all the correlation coefficients.
fracmiss, maxlevels, minprev
arguments as the default values are reasonable.getHdata(support)
< .q(slos, charges, totcst, totmcst, avtisst,
outcomes
d.time, death, hospdead, sfdm2)vClus(support, exclude=outcomes, corrmatrix=TRUE,
fracmiss=0.4, maxlevels=10, minprev=0.05,
label='figvarclus')
Variable  Reason 

dzgroup  grouped categories 
race  grouped categories 
glucose  fraction missing>0.4 
bun  fraction missing>0.4 
urine  fraction missing>0.4 
adlp  fraction missing>0.4 
The most strongly related variables are competing indicator variables for categories from the same variable, scoma
vs. dzgroup
“Coma”, and dzgroup
“Cancer” vs. dzclass
“Cancer”. The first type of relationship generates a strong negative correlation because if you’re in one category you can’t be in the other.