The Hmiscdescribe function is my main tool for getting initial descriptive statistics and quality controlling the data in a univariate fashion. Here is an example. The Info index is a measure of the information content in a numeric variable relative to the information in a continuous numeric variable with no ties. A very low value of Info will occur when a highly imbalanced variable is binary. Clicking on Glossary on the right will pop up a browser window with a more in-depth glossary of terms used in Hmisc package output. It links to hbiostat.org/R/glossary.html which you can link from your reports that use Hmisc.

Info comes from the approximate formula for the variance of a log odds ratio for a proportional odds model/Wilcoxon test, due to Whitehead. Glossary

Gmd in the output stands for Gini’s mean difference—the mean absolute difference over all possible pairs of different observations. It is a very interpretable measure of dispersion that is more robust than the standard deviation.

require(Hmisc)getRs('reptools.r')hookaddcap() # make knitr call a function at the end of each chunk# to try to automatically add to list of figuregetHdata(stressEcho)d <- stressEcho

describe Output

# The callout was typed manually; could have run# makecnote(~ html(describe(d)), wide=TRUE)w <-describe(d)html(w)

restwma: Resting wall motion abnormality on echocardiogram

n

missing

distinct

Info

Sum

Mean

Gmd

558

0

2

0.745

257

0.4606

0.4978

posSE: Positive stress echocardiogram

n

missing

distinct

Info

Sum

Mean

Gmd

558

0

2

0.553

136

0.2437

0.3693

newMI: New myocardial infarction

n

missing

distinct

Info

Sum

Mean

Gmd

558

0

2

0.143

28

0.05018

0.09549

newPTCA: Recent angioplasty

n

missing

distinct

Info

Sum

Mean

Gmd

558

0

2

0.138

27

0.04839

0.09226

newCABG: Recent bypass surgery

n

missing

distinct

Info

Sum

Mean

Gmd

558

0

2

0.167

33

0.05914

0.1115

death

n

missing

distinct

Info

Sum

Mean

Gmd

558

0

2

0.123

24

0.04301

0.08247

hxofHT: History of hypertension

n

missing

distinct

Info

Sum

Mean

Gmd

558

0

2

0.625

393

0.7043

0.4173

hxofDM: History of diabetes

n

missing

distinct

Info

Sum

Mean

Gmd

558

0

2

0.699

206

0.3692

0.4666

hxofCig: History of smoking

n

missing

distinct

558

0

3

Value heavy moderate non-smoker
Frequency 122 138 298
Proportion 0.219 0.247 0.534

hxofMI: History of myocardial infarction

n

missing

distinct

Info

Sum

Mean

Gmd

558

0

2

0.599

154

0.276

0.4004

hxofPTCA: History of angioplasty

n

missing

distinct

Info

Sum

Mean

Gmd

558

0

2

0.204

41

0.07348

0.1364

hxofCABG: History of coronary artery bypass surgery

n

missing

distinct

Info

Sum

Mean

Gmd

558

0

2

0.399

88

0.1577

0.2661

any.event: Death, newMI, newPTCA, or newCABG

n

missing

distinct

Info

Sum

Mean

Gmd

558

0

2

0.402

89

0.1595

0.2686

ecg: Baseline electrocardiogram diagnosis

n

missing

distinct

558

0

3

Value normal equivocal MI
Frequency 311 176 71
Proportion 0.557 0.315 0.127

# To create a separate browser window:cat(html(w), file='desc.html')browseURL('desc.html', browser='firefox -new-window')

Better, whether using RStudio or not:

htmlView(w, contents(d)) # or htmlView(describe(d1), describe(d2), ...)# Use htmlViewx to use an external browser window (see above)

There is also a plot method for describe output. It produces two graphics objects: one for categorical variables and one for continuous variables. The default is to use ggplot2 to produce static graphics. The result can be fed directly into maketabs described earlier. results='asis' must appear in the chunk header.

cap <-'Regular `plot(describe)` output'maketabs(plot(w, bvspace=2.5), basecap=cap, cap=1)

By specifying grType option you can instead get plotly graphics that use hover text to show more information, especially when hovering over the leftmost dot or tick mark for a variable.

See this for other Hmisc functions for descriptive graphics and tables, especially for stratified descriptive statistics for categorical variables. The summaryM function prints a tabular summary of a mix of continuous and categorical variables. Here is an example where stratification is by history of myocardial infarction (MI).

require(data.table)setDT(d) # turn d into a data table# tables() with no arguments will give a concise summary of all active data tablesw <- dw[, hxofMI :=factor(hxofMI, 0:1, c('No history of MI', 'History of MI'))]vars <-setdiff(names(d), 'hxofMI')form <-as.formula(paste(paste(vars, collapse='+'), '~ hxofMI'))print(form)

s <-summaryM(form, data=d, test=TRUE)# Note: there is a problem with the width of the categorical variable# plot. Neither fig.size() nor options(plotlyauto=FALSE) fixed it.source('~/R/rscripts/reptools.r')maketabs(``~``, # empty tab`Table 1`~html(s, exclude1=TRUE, npct='both', digits=3, middle.bold=TRUE),`Categorical Variable Plot`~plot(s, which='categorical', vars=1:4) +caption('`summaryM` plots') +fig.size(width=6),`Continuous Variable Plot`~plot(s, which='continuous', vars=1:4),wide=TRUE)

abc represent the lower quartile a, the median b, and the upper quartile c for continuous variables. Tests used: ^{1}Wilcoxon test; ^{2}Pearson test .

Semi-interactive stratified spike histograms are also useful descriptive plots. These plots also contain a superset of the quantiles used in box plots, and the legend is clickable, allowing any of the statistical summaries to be turned off.

d[, histboxp(x=maxhr, group=ecg, bins=200)]

9.1 Longitudinal Continuous Y

For a continuous response variable measured longitudinally, two of the ways to display the data are

“spaghetti plots” showing all data for all individual subjects, if the number of subjects is not very large

finding clusters of individual subject curve characteristics and plotting a random sample of raw data curves within each cluster if the sample size is large

The Hmisc package curveRep function facilitates the latter approach using representative curves. It considers per-subject sample sizes and measurement time gaps as these will not only limit how we look at the data but may be informative, e.g., subjects who are failing may start to get more frequent measurements.

To demonstrate we simulate data in the following way.

Simulate 200 subjects (curves) with per-curve sample sizes ranging from 1 to 10

Make curves with odd-numbered IDs have a measurement time distribution that is random uniform [0,1] and those with even-numbered IDs have a time distribution that is half as wide but still centered at 0.5. Shift y values higher with increasing IDs

Make 1/3 of subjects have a flat trajectory, 1/3 linear and not flat, and 1/3 quadratic

Request curveRep to cluster the 200 curves on the following characteristics:

kn=3 sample size groups (curveRep actually used one more than this)

kxdist=2 time point distribution groups based here on the earliest and latest measurement times and the longest gap between any two measurements within a subject

k=4 trajectory clusters. Trajectories are determined by loess-smoothing each subject’s curve and linearly interpolating the estimates to the same evenly-spaced grid of time points, then clustering on these 10 y-estimates. This captures intercepts, slopes, and shapes.

All subjects within each cluster are shown; we didn’t need to take random samples for 200 subjects. Results for the 4 per-curve sample size groups are placed in separate tabs.

set.seed(1)N <-200nc <-sample(1:10, N, TRUE)id <-rep(1:N, nc)x <- y <- id# Solve for coefficients of quadratic function such that it agrees with # the linear function at x=0.25, 0.75 and is lower by -delta at x=0.5delta <--3cof <-list(c(0, 0, 0), c(0, 10, 0), c(delta, 10, -2* delta / (2*0.25^2)))for(i in1: N) { x[id==i] <-if(i %%2) runif(nc[i]) elserunif(nc[i], c(.25, .75)) shape <-sample(1:3, 1) xc <- x[id == i] -0.5 k <- cof[[shape]] y[id == i] <- i/20+ k[1] + k[2] * xc + k[3] * xc ^2+runif(nc[i], -2.5, 2.5)}require(cluster)w <-curveRep(x, y, id, kn=3, kxdist=2, k=4, p=10)gg <-vector('list', 4)nam <-rep('', 4)for(i in1:4) { z <-plot(w, i, method='data') # method='data' available in Hmisc 4.7-1 z <-transform(z,distribution =paste('Time Distribution:', distribution),cluster =paste('Trajectory Cluster:', cluster)) gg[[i]] <-if(i ==1)ggplot(z, aes(x, y, color=curve)) +geom_point() +facet_grid(distribution ~ cluster) +theme(legend.position='none')elseggplot(z, aes(x, y, color=curve)) +geom_line() +facet_grid(distribution ~ cluster) +theme(legend.position='none') nam[i] <- z$ninterval[1]}names(gg) <- nammaketabs(gg, basecap='Representative curves determined by `curveRep` stratified by per-subject sample size ranges', cap=1)

Continuous longitudinal Y may be analyzed flexibly using semiparametric models while being described using representative curves as just discussed. Now suppose that Y is discrete. There are three primary ways of describing such data graphically.

Show event occurrences/trajectories for a random sample of subjects (Hmisc::multEventChart)

Show all transition proportions if time is discrete (Hmisc::propsTrans)

Show all state occupancy proportions if time is discrete (Hmisc::propsPO)

Thanks to Lucy D’Agostino McGowan for writing most of the code for multEventChart

Starting with a multiple event chart, simulate data on 5 patients, then display all the raw data.

multEventChart(status ~ day + pt, data=d,absorb=.q(death, discharged),colorTitle='Status', sortbylast=TRUE) +theme_classic() +theme(legend.position='bottom')

For an example of displaying transition proportions, whereby all the outcome information in the raw data is shown, simulate some random data. The result is a plotly graphic over which you hover the pointer to see details. The size of the bubbles is proportional to the proportion in that transition.

set.seed(1)d <-expand.grid(id=1:30, time=1:7)setDT(d) # convert to data.tabled[, sex :=sample(c('female', 'male'), .N, replace=TRUE)]d[, state :=sample(LETTERS[1:4], .N, replace=TRUE)]ggplotlyr(propsTrans(state ~ time + id, data=d))

The final display doesn’t capture the within-subject correlation as done with the transition proportions, but is the most familiar display for longitudinal ordinal data as it shows proportions in the current states, which are cumulative incidence estimates for absorbing states.

Should absorbing states have occurred in the data you would need to carry these forward to the end of follow-up for propsPO to work properly, even though the real data file would terminate follow-up at an absorbing event.

ggplotlyr(propsPO(state ~ time + sex, data=d))

9.3 Adverse Event Chart

When there is a potentially large number of event types, such as adverse events (AEs) in a clinical trial, and the event timing is not considered, a dot chart is an excellent way to present the proportion of subjects suffering each type of AE. The AEs can be sorted in descending order by the difference in proportions between treatments, and plotly hover text can display more details. Half-width confidence intervals are used (see Section 15.6). An AE chart is easily produced using the aePlot function in the qreport repository on github. aePlot expects the dataset to have one record per subject per AE, so the dataset itself does not define the proper denominator and this must be specified by the user (see denom below). The color coded needles in the right margin are guideposts to which denominators are being used in the analysis (details are here).

getRs('misc.r', grepo='qreport')getRs('aePlot.r', grepo='qreport')getHdata(aeTestData) # original data source: HH package# One record per subject per adverse event# For this example, the denominators for the two treatments in the# pop-up needles will be incorrect because the dataset did not have# subject IDs.ae <- aeTestData# qreport requires us to define official clinical trial counts and # name of treatment variabledenom <-c(enrolled =1000,randomized =400,a=212, b=188)setqreportOption(tx.var='treat', denom=denom)aePlot(event ~ treat, data=ae, minincidence=.05, size='wide')

Category

N

Used

enrolled

1000

400

randomized

400

400

a

212

212

b

188

188

9.4 Continuous Event Times

For time-to-event data with possibly multiple types of events, an event chart is a good way to show the raw outcome data for a sample of up to perhaps 40 subjects. The Hmisc package offers the event.chart function written by Jack Lee, Kenneth Hess, and Joel Dubin. Here is an example they provided. Patients are sorted by diagnosis date.

The most basic way to examine interrelationships among variables is to graphically depict a correlation matrix. Below is an example on support using the Spearman’s \(\rho\) rank correlation coefficient. Another good descriptive analysis to help understand relationships among variables and redundancies/collinearities is variable clustering. Here one clusters on variables instead of observations, and instead of a distance matrix we have a similarly matrix. A good default similarity matrix is based on the square of \(\rho\). In order to restrict ourselves to unsupervised learning, also called data reduction, we restrict attention to non-outcome variables in both displays. The vClus function in reptools runs the dataset, after excluding some variables, through the HmiscdataframeReduce function to eliminate variables that are missing more than 0.4 of the time and to ignore character or factor variables having more than 10 levels. Binary variables having prevalence < 0.05 are dropped, and categorical variables having < 0.05 of their non-missing values in a category will have such low-frequency categories combined into an “other” category for purposes of computing all the correlation coefficients.

Normally one would omit the fracmiss, maxlevels, minprev arguments as the default values are reasonable.

The most strongly related variables are competing indicator variables for categories from the same variable, scoma vs. dzgroup “Coma”, and dzgroup “Cancer” vs. dzclass “Cancer”. The first type of relationship generates a strong negative correlation because if you’re in one category you can’t be in the other.

Source Code

# Descriptive Statistics {#sec-descript}```{mermaid}flowchart LRdes[describe Function] --> ss[Statistical Summary] & dsh[Spike Histogram]cat[Categorical Variables] --> dot[Frequency Dot Charts]con[Continuous Variables] --> sh[Spike Histograms] & ebp[Extended Box Plots]lon[Longitudinal Data] --> sg[Spaghetti Graphs] & rc[Representative Curves] & ol[Ordinal Transitions<br>and States]ev[Events] --> mc[Multi-category Event Charts] & tl[Timelines]rel[Relationships] --> cm[Graphical Correlation Matrix] & vc[Variable Clustering]```The `Hmisc``describe` function is my main tool for getting initial descriptive statistics and quality controlling the data in a univariate fashion. Here is an example. The `Info` index is a measure of the information content in a numeric variable relative to the information in a continuous numeric variable with no ties. A very low value of `Info` will occur when a highly imbalanced variable is binary. Clicking on `Glossary` on the right will pop up a browser window with a more in-depth glossary of terms used in `Hmisc` package output. It links to `hbiostat.org/R/glossary.html` which you can link from your reports that use `Hmisc`.[`Info` comes from the [approximate formula](http://hbiostat.org/bib/r2.html) for the variance of a log odds ratio for a proportional odds model/Wilcoxon test, due to Whitehead.<br><a href="https://hbiostat.org/R/glossary.html" target="popup" onclick="window.open('https://hbiostat.org/R/glossary.html#describe', 'popup', 'width=450,height=600'); return false;"><small>Glossary</small></a>]{.aside}`Gmd` in the output stands for Gini's mean difference---the mean absolute difference over all possible pairs of different observations. It is a very interpretable measure of dispersion that is more robust than the standard deviation.```{r setup}require(Hmisc)getRs('reptools.r')hookaddcap() # make knitr call a function at the end of each chunk# to try to automatically add to list of figuregetHdata(stressEcho)d <- stressEcho```::: {.callout-note .column-page collapse="true"}# `describe` Output```{r desc,results='asis'}# The callout was typed manually; could have run# makecnote(~ html(describe(d)), wide=TRUE)w <-describe(d)html(w)```:::```{r eval=FALSE}# To create a separate browser window:cat(html(w), file='desc.html')browseURL('desc.html', browser='firefox -new-window')```Better, whether using `RStudio` or not:```{r eval=FALSE}htmlView(w, contents(d)) # or htmlView(describe(d1), describe(d2), ...)# Use htmlViewx to use an external browser window (see above)```There is also a `plot` method for `describe` output. It produces two graphics objects: one for categorical variables and one for continuous variables. The default is to use `ggplot2` to produce static graphics. The result can be fed directly into `maketabs` described earlier. `results='asis'` must appear in the chunk header.```{r pdesc,results='asis'}cap <-'Regular `plot(describe)` output'maketabs(plot(w, bvspace=2.5), basecap=cap, cap=1)```By specifying `grType` option you can instead get `plotly` graphics that use hover text to show more information, especially when hovering over the leftmost dot or tick mark for a variable.```{r pldesc,results='asis'}options(grType='plotly')cap <-'`plotly` `plot(describe)` output'maketabs(plot(w, bvspace=2.5), wide=TRUE, cap=1, basecap=cap)```See [this](http://hbiostat.org/R/Hmisc/examples.html) for other `Hmisc` functions for descriptive graphics and tables, especially for stratified descriptive statistics for categorical variables. The `summaryM` function prints a tabular summary of a mix of continuous and categorical variables. Here is an example where stratification is by history of myocardial infarction (MI).```{r summaryM,results='asis'}require(data.table)setDT(d) # turn d into a data table# tables() with no arguments will give a concise summary of all active data tablesw <- dw[, hxofMI :=factor(hxofMI, 0:1, c('No history of MI', 'History of MI'))]vars <-setdiff(names(d), 'hxofMI')form <-as.formula(paste(paste(vars, collapse='+'), '~ hxofMI'))print(form)s <-summaryM(form, data=d, test=TRUE)# Note: there is a problem with the width of the categorical variable# plot. Neither fig.size() nor options(plotlyauto=FALSE) fixed it.source('~/R/rscripts/reptools.r')maketabs(``~``, # empty tab`Table 1`~html(s, exclude1=TRUE, npct='both', digits=3, middle.bold=TRUE),`Categorical Variable Plot`~plot(s, which='categorical', vars=1:4) +caption('`summaryM` plots') +fig.size(width=6),`Continuous Variable Plot`~plot(s, which='continuous', vars=1:4),wide=TRUE)```Semi-interactive stratified spike histograms are also useful descriptive plots. These plots also contain a superset of the quantiles used in box plots, and the legend is clickable, allowing any of the statistical summaries to be turned off.```{r spikeh}#| label: fig-spikeh#| fig-cap: Spike histograms of heart rate stratified by ECG category#| fig-height: 2.5d[, histboxp(x=maxhr, group=ecg, bins=200)]```## Longitudinal Continuous YFor a continuous response variable measured longitudinally, two of the ways to display the data are* "spaghetti plots" showing all data for all individual subjects, if the number of subjects is not very large* finding clusters of individual subject curve characteristics and plotting a random sample of raw data curves within each cluster if the sample size is largeThe `Hmisc` package [`curveRep`](https://www.rdocumentation.org/packages/Hmisc/versions/4.7-0/topics/curveRep) function facilitates the latter approach using representative curves. It considers per-subject sample sizes and measurement time gaps as these will not only limit how we look at the data but may be informative, e.g., subjects who are failing may start to get more frequent measurements.To demonstrate we simulate data in the following way.* Simulate 200 subjects (curves) with per-curve sample sizes ranging from 1 to 10* Make curves with odd-numbered IDs have a measurement time distribution that is random uniform [0,1] and those with even-numbered IDs have a time distribution that is half as wide but still centered at 0.5. Shift y values higher with increasing IDs* Make 1/3 of subjects have a flat trajectory, 1/3 linear and not flat, and 1/3 quadraticRequest `curveRep` to cluster the 200 curves on the following characteristics:* `kn=3` sample size groups (`curveRep` actually used one more than this)* `kxdist=2` time point distribution groups based here on the earliest and latest measurement times and the longest gap between any two measurements within a subject* `k=4` trajectory clusters. Trajectories are determined by `loess`-smoothing each subject's curve and linearly interpolating the estimates to the same evenly-spaced grid of time points, then clustering on these 10 y-estimates. This captures intercepts, slopes, and shapes.All subjects within each cluster are shown; we didn't need to take random samples for 200 subjects. Results for the 4 per-curve sample size groups are placed in separate tabs.```{r long,results='asis'}set.seed(1)N <-200nc <-sample(1:10, N, TRUE)id <-rep(1:N, nc)x <- y <- id# Solve for coefficients of quadratic function such that it agrees with # the linear function at x=0.25, 0.75 and is lower by -delta at x=0.5delta <--3cof <-list(c(0, 0, 0), c(0, 10, 0), c(delta, 10, -2* delta / (2*0.25^2)))for(i in1: N) { x[id==i] <-if(i %%2) runif(nc[i]) elserunif(nc[i], c(.25, .75)) shape <-sample(1:3, 1) xc <- x[id == i] -0.5 k <- cof[[shape]] y[id == i] <- i/20+ k[1] + k[2] * xc + k[3] * xc ^2+runif(nc[i], -2.5, 2.5)}require(cluster)w <-curveRep(x, y, id, kn=3, kxdist=2, k=4, p=10)gg <-vector('list', 4)nam <-rep('', 4)for(i in1:4) { z <-plot(w, i, method='data') # method='data' available in Hmisc 4.7-1 z <-transform(z,distribution =paste('Time Distribution:', distribution),cluster =paste('Trajectory Cluster:', cluster)) gg[[i]] <-if(i ==1)ggplot(z, aes(x, y, color=curve)) +geom_point() +facet_grid(distribution ~ cluster) +theme(legend.position='none')elseggplot(z, aes(x, y, color=curve)) +geom_line() +facet_grid(distribution ~ cluster) +theme(legend.position='none') nam[i] <- z$ninterval[1]}names(gg) <- nammaketabs(gg, basecap='Representative curves determined by `curveRep` stratified by per-subject sample size ranges', cap=1)```## Longitudinal Ordinal YContinuous longitudinal Y may be analyzed flexibly [using semiparametric models](https://hbiostat.org/proj/covid19) while being described using representative curves as just discussed. Now suppose that Y is discrete. There are three primary ways of describing such data graphically.1. Show event occurrences/trajectories for a random sample of subjects (`Hmisc::multEventChart`) [Thanks to Lucy D'Agostino McGowan for writing most of the code for `multEventChart`]{.aside}1. Show all transition proportions if time is discrete (`Hmisc::propsTrans`)1. Show all state occupancy proportions if time is discrete (`Hmisc::propsPO`)Starting with a multiple event chart, simulate data on 5 patients, then display all the raw data.```{r multevent}#| label: fig-descript-multevent#| fig-cap: "Multi-event chart for 4 patients"#| fig-height: 3.5pt1 <-data.frame(pt=1, day=0:3,status=.q(well, well, sick, 'very sick'))pt2 <-data.frame(pt=2, day=c(1,2,4,6),status=.q(sick, 'very sick', coma, death))pt3 <-data.frame(pt=3, day=1:5,status=.q(sick, 'very sick', sick, 'very sick', 'discharged'))pt4 <-data.frame(pt=4, day=c(1:4, 10),status=.q(well, sick, 'very sick', well, discharged))d <-rbind(pt1, pt2, pt3, pt4)d <-upData(d, status =factor(status, .q(discharged, well, sick,'very sick', coma, death)),labels =c(day ='Day'), print=FALSE)kabl(pt1, pt2, pt3, pt4)multEventChart(status ~ day + pt, data=d,absorb=.q(death, discharged),colorTitle='Status', sortbylast=TRUE) +theme_classic() +theme(legend.position='bottom')```For an example of displaying transition proportions, whereby all the outcome information in the raw data is shown, simulate some random data. The result is a `plotly` graphic over which you hover the pointer to see details. The size of the bubbles is proportional to the proportion in that transition.```{r propsTrans}#| label: fig-strans#| fig-cap: State transition proportionsset.seed(1)d <-expand.grid(id=1:30, time=1:7)setDT(d) # convert to data.tabled[, sex :=sample(c('female', 'male'), .N, replace=TRUE)]d[, state :=sample(LETTERS[1:4], .N, replace=TRUE)]ggplotlyr(propsTrans(state ~ time + id, data=d))```The final display doesn't capture the within-subject correlation as done with the transition proportions, but is the most familiar display for longitudinal ordinal data as it shows proportions in the current states, which are cumulative incidence estimates for absorbing states. [Should absorbing states have occurred in the data you would need to carry these forward to the end of follow-up for `propsPO` to work properly, even though the real data file would terminate follow-up at an absorbing event.]{.aside}```{r propsPO}#| label: fig-sops#| fig-cap: State occupancy proportions by time and male/female#| fig-height: 3ggplotlyr(propsPO(state ~ time + sex, data=d))```## Adverse Event ChartWhen there is a potentially large number of event types, such as adverse events (AEs) in a clinical trial, and the event timing is not considered, a dot chart is an excellent way to present the proportion of subjects suffering each type of AE. The AEs can be sorted in descending order by the difference in proportions between treatments, and `plotly` hover text can display more details. Half-width confidence intervals are used (see @sec-confbands). An AE chart is easily produced using the `aePlot` function in the `qreport` repository on [`github`](https://github.com/harrelfe/qreport/tree/master). `aePlot` expects the dataset to have one record per subject per AE, so the dataset itself does not define the proper denominator and this must be specified by the user (see `denom` below). The color coded needles in the right margin are guideposts to which denominators are being used in the analysis (details are [here](http://hbiostat.org/R/hreport/report.html)).```{r aeplot,results='asis'}#| fig-height: 7.5getRs('misc.r', grepo='qreport')getRs('aePlot.r', grepo='qreport')getHdata(aeTestData) # original data source: HH package# One record per subject per adverse event# For this example, the denominators for the two treatments in the# pop-up needles will be incorrect because the dataset did not have# subject IDs.ae <- aeTestData# qreport requires us to define official clinical trial counts and # name of treatment variabledenom <-c(enrolled =1000,randomized =400,a=212, b=188)setqreportOption(tx.var='treat', denom=denom)aePlot(event ~ treat, data=ae, minincidence=.05, size='wide')```## Continuous Event TimesFor time-to-event data with possibly multiple types of events, an event chart is a good way to show the raw outcome data for a sample of up to perhaps 40 subjects. The `Hmisc` package offers the `event.chart` function written by Jack Lee, Kenneth Hess, and Joel Dubin. Here is an example they provided. Patients are sorted by diagnosis date.```{r event.chart}#| label: fig-event.chart#| fig-cap: Event chartgetHdata(cdcaids)event.chart(cdcaids,subset.c=.q(infedate, diagdate, dethdate, censdate),x.lab ='Observation Dates',y.lab='Patients',titl='AIDS Data Calendar Event Chart',point.pch=c(1,2,15,0), point.cex=c(1,1,0.8,0.8),legend.plot=TRUE, legend.location='i', legend.cex=0.8,legend.point.text=.q(transfusion,'AIDS diagnosis',death,censored),legend.point.at =list(c(7210, 8100), c(35, 27))) ```## Describing Variable InterrelationshipsThe most basic way to examine interrelationships among variables is to graphically depict a correlation matrix. Below is an example on `support` using the Spearman's $\rho$ rank correlation coefficient. Another good descriptive analysis to help understand relationships among variables and redundancies/collinearities is variable clustering. Here one clusters on variables instead of observations, and instead of a distance matrix we have a similarly matrix. A good default similarity matrix is based on the square of $\rho$. In order to restrict ourselves to _unsupervised learning_, also called _data reduction_, we restrict attention to non-outcome variables in both displays. The `vClus` function in `reptools` runs the dataset, after excluding some variables, through the `Hmisc``dataframeReduce` function to eliminate variables that are missing more than 0.4 of the time and to ignore character or factor variables having more than 10 levels. Binary variables having prevalence < 0.05 are dropped, and categorical variables having < 0.05 of their non-missing values in a category will have such low-frequency categories combined into an "other" category for purposes of computing all the correlation coefficients.[Normally one would omit the `fracmiss, maxlevels, minprev` arguments as the default values are reasonable.]{.aside}```{r varclus,results='asis'}getHdata(support)outcomes <-.q(slos, charges, totcst, totmcst, avtisst, d.time, death, hospdead, sfdm2)vClus(support, exclude=outcomes, corrmatrix=TRUE,fracmiss=0.4, maxlevels=10, minprev=0.05,label='fig-varclus')```The most strongly related variables are competing indicator variables for categories from the same variable, `scoma` vs. `dzgroup` "Coma", and `dzgroup` "Cancer" vs. `dzclass` "Cancer". The first type of relationship generates a strong negative correlation because if you're in one category you can't be in the other.```{r echo=FALSE}saveCap('09')```