flowchart LR
PA[Principled Analysis]
PA --> plan[Pre-specification to Limit Double Dipping]
PA --> rep[Reproducibility]
PA --> raw[Respect Raw Data in Analysis]
PA --> nd[Never Categorize Continuous or Ordinal Variables]
PA --> sc[Choose Statistics and Uncertainty Intervals Respecting the Design]
An[Analysis] --> Formal[Formal<br>See hbiostat.org/bbr] & DA[Descriptive]

Orders of Descriptive Analysis

flowchart LR
a1[First Order<br><hr>Summarize Distribution of X]
a2["Second Order<br><hr>Assess Shape and<br>Strength of Association<br>Between X & Y"]
a3["Third Order<br><hr>Assess How<br>Association Between<br>X & Y Varies with Z"]
a1 --> a2 --> a3

Types of Descriptive Analysis

flowchart LR
HI[High Information Displays] --> AT[Avoid Tables<br>When X is Continuous]
HI --> NT[Nonparametric Smoothers]
HI --> Dist[Distributions Depicted With<br>Spike Histograms and<br>Extended Box Plots]
HI --> mov[General Approach:<br>Statistics in Moving<br>Overlapping Windows]
For[Formatting Analysis output]

15.1 Big Picture

For analysis the sky is the limit, but statistical principles should guide every step. Some of the general principles are

If there is to be a pivotal analysis there should be a statistical analysis plan (SAP) for this analysis that does not allow for many “statistician degrees of freedom.” The plan should be completed before doing any analysis that might inform analysis choices in a way that would bias the results (e.g., bias the estimate of treatment effect or bias standard errors of effects in a model).

All analyses should be completely reproducible. Explicitly state random number seeds if any random processes (bootstrap, simulation, Bayesian posterior sampling) are involved.

Exploratory analysis can take place after any needed SAP is completed.

Stay close to the raw data. Analyze the rawest form of the data when possible. Don’t convert inherently longitudinal data into time-to-first-event.

Continuous or ordinal variables should never be dichotomized even for purely descriptive exploratory analysis. For example, computing proportions of patients with disease stratified by quintiles of weight will be both inefficient and misleading.

Descriptive and inferential statistics should respect the study design. For parallel-group studies, it is not appropriate to compute change from baseline.

Question whether unadjusted estimates should be presented. If females in the study are older and age is a risk factor for the outcome, what is the meaning of female - male differences unadjusted for age?

For observational group comparisons, make sure that experts are consulted about which variables are needed to capture selection processes (e.g., confounding by indication) before data acquisition. If data are already collected and do not contain the variables that capture reasons for decisions such as treatment selection, you may do well to find a different project.

If the study is a parallel-group randomized clinical trial (RCT), presenting descriptive statistics stratified by treatment (“Table 1”) is not helpful, and it is more informative to describe the overall distribution of subjects. Even more helpful is to show how all baseline variables relate to the outcome variable.

An RCT is designed to estimate relative treatment effectiveness, and since it does not incorporate random sampling from the population, it cannot provide outcome estimates for a single treatment arm that reference the population. Hence uncertainty intervals for per-treatment outcomes are not meaningful, and uncertainty intervals should be presented only for treatment differences. This is facilitated by “half confidence intervals” described below.

Above the tendency to interchange the roles of independent and dependent variables by presenting a “Table 2” in such a way that stratifies by the outcome. Stratifying (conditioning) on the outcome is placing it in the role of a baseline variable. Instead, show relationships of baseline variables to outcomes as mentioned in the previous point.

Nonparametric smoothers and estimating in overlapping moving windows are excellent tools for relating individual continuous variables to an outcome.

Models are often the best descriptive tools because they can account for multiple variables simultaneously. For example, instead of computing proportions of missing values of a variable Y stratified by age groups and sex, use a binary logistic regression model to relate smooth nonlinear age and sex to the probability Y is missing.

15.2 Replacement for Table 1

Analyses should shed light on the unknown and not dwell on the known. In a randomized trial, the distributions of baseline variables are expected to be the same across treatments, and will be the same once \(N\) is large. When apparent imbalances are found, they lead to inappropriate decisions and ignore the fact that apparently counterbalancing factors are not hard to find. What is unknown and new is how the subject characteristics (and treatment) relate to the outcomes under study. While displaying this trend with a nonparametric smoother, one can simultaneously display the marginal distribution of the characteristic using an extended box plot, spike histogram, or rug plot.

A useful approach to replicating the same analysis for multiple variables is to “melt” the data table into a tall and thin one, with a single variable (here value) holding the original variable values, and another variable (here variable) holding the name of the variable whose values are currently contained in value. Thanks to ggplot2 having a wide variety of summarization functions built-in, the melted data table can be passed to ggplot2 and the variable easily used to create multiple panels (facets). Here is an example using the meltData and addggLayers functions from reptools. Extended box plots at the top show the mean (blue dot), median, and quantiles that cover 0.25, 0.5, 0.75, and 0.9 of the distribution. In addition to standard extended box plot quantiles, we show the 0.01 and 0.99 quantiles as dots. At the bottom is a spike histogram.

require(Hmisc)require(data.table)getRs('reptools.r')hookaddcap() # make knitr call a function at the end of each chunk# to try to automatically add to list of figure

getHdata(support)setDT(support)m <-meltData(hospdead ~ age + crea + meanbp + wblc, data=support)g <-ggplot(m, aes(x=value, y=hospdead)) +geom_smooth() +facet_wrap(~ variable, scales='free_x') +xlab('') +ylab('Probability of Death in Hospital') +ylim(0,1)g <-addggLayers(g, m, pos='top')addggLayers(g, m, type='spike')

Here is a prototype extended box plot to assist interpretation.

bpplt()

Here are more examples of extended box plots for showing distributions of continuous variables, stratified by disease group.

This is better done with interactive plots so that one can for example hover over a corner of a box plot and see which quantile that corner represents.

s <-summaryM(age + crea + meanbp + wblc ~ dzgroup,data=support)options(grType='plotly')plot(s)

15.3 Descriptively Relating One Variable to Another

To understand the relationship between a continuous variable X and an outcome or another variable Y we may estimate the mean, median, and other quantities as a smooth function of X. There are many ways to do this, including

For binary Y the mean is the proportion of ones, which estimates the probability that Y=1

making a scatter plot if Y is continuous or almost continuous

stratifying by fixed or variable intervals of X, e.g., summarizing Y by quintiles of X. This is arbitrary, inefficient, and misleading and should never be done.

using a nonparametric smoother such as loess

parametrically estimating the mean Y as a function of X using an ordinary linear least squares (OLS) model with a regression spline in X so as to not assume linearity

likewise but with a logistic regression model if Y is binary

semiparametrically estimating quantiles of Y as a function of X using quantile regression and a regression spline for X

semiparametrically estimating the mean, quantiles, and exceedance probabilities of Y as a function of X using an ordinal regression model and a spline in X

nonparametrically using overlapping moving windows of X that advance by a small amount each time. For each window compute the estimate of the property of Y using ordinary sample estimators (means, quantiles, Kaplan-Meier estimates, etc.). This approach has the fewest assumptions and is very general in the sense that all types of Y are accommodated. The moving estimates need to be smoothed; the R supsmu function is well suited for this.

The estimated trend curves depend on the window width and amount of smoothing, but this problem is tiny in comparison with the huge effect of changing how a continuous predictor is binned when the usual non-overlapping strata are created. The idea is assume smooth relationships and get close to the data.

In the following several of the above methods are illustrated to study how serum creatinine of critically ill patients relates to age. Start with a scatterplot that has no problems with ties in the data.

with(support, ggfreqScatter(age, crea))

Now consider moving estimates, least squares (OLS), ordinal regression (ORM), and quantile regression (QR) estimates, nonparametric loess estimates, and a flexible adaptive survival model. Moving estimates computed on overlapping x-variable windows, moving averages being the oldest example, have the advantage of great flexibility. As long as one has an estimator (mean, median, Kaplan-Meier estimate, etc.) that can be applied to a relatively homogeneous (with respect to x) sample, moving statistics can estimate smooth trends over x. Unless the windows are wide or the sample size is very large so that one can afford to use narrow x windows, the moving statistics will be noisy and need to be further smoothed. The smaller the windows, the larger the amount of smoothing will be needed. To control bias it is generally better to have smaller windows and more after-estimation smoothing.

The function movStats on Github provides two methods for creating moving overlapping windows from x. The default used here creates varying-width intervals in the data space but fixed-width in terms of sample size. It includes by default 15 observations to the left of the target point and 15 to the right, and moves up \(\max(\frac{n}{200}, 1)\) observations for each evaluation of the statistics. These may be overridden by specifying eps and xinc. If the user does not provide a statistical estimation function stat, the mean and all three quartiles are estimated for each window. movStats makes heavy use of the data.table, rms, and other packages. For ordinal regression estimates of the mean and quantiles the log-log link is used in the example below. Moving estimates are shown with and without supsmu-smoothing them.

# pr='margin' causes window information to be put in marginggplot(u, aes(x=age, y=crea, col=Type)) +geom_line() +facet_wrap(~ Statistic) +xlab('Age') +ylab('Serum Creatinine')

Recommended practice for relating a continuous variable to another continuous variable, especially for replacing parts of Table 1 or Table 2, is to use smoothed moving statistics or (1) a spline OLS model to estimate the mean and (2) a spline quantile regression model for estimating quantiles. Here is an example best practice that shows a preferred subset of the estimates from the last plot. melt=TRUE is omitted so we can draw a ribbon to depict the outer quartiles.

Let’s describe how white blood count relates to the probability of hospital death, using a binary logistic regression model and moving proportions. The cube root transformation in regression fits is used because of the extreme skewness of WBC. Use 6 knots at default locations on \(\sqrt[3]{\mathrm{WBC}}\). The \(\sqrt[3]{\mathrm{WBC}}\) transformation affects moving statistics only in that mean x-values for plotting are cubes of mean \(\sqrt[3]{\mathrm{WBC}}\) instead of means on the original WBC scale.

u <-movStats(hospdead ~ wblc, k=6, eps=20, bass=3,trans =function(x) x ^ (1/3),itrans =function(x) x ^3,loess=TRUE, lrm=TRUE, msmooth='both',melt=TRUE, pr='margin', data=support)

The flexibility of the moving statistic method is demonstrated by estimating how age relates to probabilities of death within 1y and within 2y using Kaplan-Meier estimates in overlapping moving windows. Assumptions other than smoothness (e.g., proportional hazards) are avoided in this approach. Here is an example that also uses an flexible parametric method, hazard regression, implemented in the R polspline package, that adaptively finds knots (points of slope change) in the covariate and in time, and products of piecewise linear terms so as to allow for non-proportional hazards. We use far less penalization than is the default for the hare function for demonstration purposes. For this dataset the default settings of penalty and maxdim result in straight lines.

movStats can also compute stratified non-smoothed estimates when x is discrete. After computing 1- and 2y Kaplan-Meier incidence probability estimates, order disease groups by ascending order of 1-year mortality before plotting.

It is possible to descriptively estimate trends against more than one independent variables when the effective sample size is sufficient. Trends can be estimated nonparametrically through stratification (when the third variable is categorical) or with flexible regression models allowing the two predictors to interact. In the graphical displays it is useful to keep sample size limitations in certain regions of the space defined by the two predictors in mind, by superimposing spike histograms on trend curves.

Repeat the last example but stratified by disease class. The window is widened a bit because of the reduced sample size upon stratification. Default smoothing is used for hazard regression.

# The Coma stratum has only n=60 so is not compatible with eps=75# Use varyeps optionsu <-movStats(Surv(d.time /365.25, death) ~ age + dzclass, times=1:2,eps=30,msmooth='both', bass=8, hare=TRUE,melt=TRUE, data=support, pr='margin')

Consider another example with a continuous dependent variable. Use the NHANES dataset that was created for analyzing glycohemoglobin (HbA\(_{\mathrm{1c}}\)) for diabetes screening. Stratify by race/ethnicity

getHdata(nhgh)u <-movStats(gh ~ age + re,melt=TRUE, data=nhgh, pr='margin')

Mimic these results using flexible regression with interaction. Start by estimating the mean. Add spike histograms to estimated trend curves. Spike heights are proportional to the sample size in age/race-ethnicity groups after binning age into 100 bins. Direct plotly plotting is used. The user can click on elements of the legend (including the histograms) to turn their display off and on.

require(rms)options(prType='html') # needed to use special formatting (can use prType='latex')dd <-datadist(nhgh); options(datadist='dd')f <-ols(gh ~rcs(age, 5) * re, data=nhgh)# fontsize will be available for print(anova()) in rms 6.3-1makecolmarg(anova(f), dec.ms=2, dec.ss=2, fontsize=0.6)

Analysis of Variance for gh

d.f.

Partial SS

MS

F

P

age (Factor+Higher Order Factors)

20

878.06

43.90

55.20

<0.0001

All Interactions

16

42.55

2.66

3.34

<0.0001

Nonlinear (Factor+Higher Order Factors)

15

61.26

4.08

5.13

<0.0001

re (Factor+Higher Order Factors)

20

169.42

8.47

10.65

<0.0001

All Interactions

16

42.55

2.66

3.34

<0.0001

age × re (Factor+Higher Order Factors)

16

42.55

2.66

3.34

<0.0001

Nonlinear

12

14.62

1.22

1.53

0.1051

Nonlinear Interaction : f(A,B) vs. AB

12

14.62

1.22

1.53

0.1051

TOTAL NONLINEAR

15

61.26

4.08

5.13

<0.0001

TOTAL NONLINEAR + INTERACTION

19

101.38

5.34

6.71

<0.0001

REGRESSION

24

937.86

39.08

49.13

<0.0001

ERROR

6770

5384.94

0.80

# Normal printing: anova(f) or anova(f, dec.ms=2, dec.ss=2)hso <-list(frac=function(f) 0.1* f /max(f),side=1, nint=100)# Plot with plotly directlyplotp(Predict(f, age, re), rdata=nhgh, histSpike.opts=hso)

Now use quantile regression to estimate quartiles of glycohemoglobin as a function of age and race/ethnicity.

f1 <-Rq(gh ~rcs(age, 5) * re, tau=0.25, data=nhgh)f2 <-Rq(gh ~rcs(age, 5) * re, tau=0.5, data=nhgh)f3 <-Rq(gh ~rcs(age, 5) * re, tau=0.75, data=nhgh)p <-rbind(Q1 =Predict(f1, age, re, conf.int=FALSE),Median =Predict(f2, age, re, conf.int=FALSE),Q3 =Predict(f3, age, re, conf.int=FALSE))ggplot(p, histSpike.opts=hso)

15.5 Another Replacement for Table 1

We can create a matrix of plots that respect continuous baseline variables while staying close to the data through the use of overlapping moving windows. In the following example we compute moving 1y and 2y mortality for selected continuous baseline variables in support and stack them together. Flexible HARE hazard regression estimates are also included.

reptools includes a function varType to determine the continuous/discrete nature of each variable, and other functions that make it easy to extract the list of either continuous variables (conVars) or discrete variables (disVars). varType also has a third classification: non-numeric variables that have too many (by default > 20) distinct values to be considered discrete.

$continuous
[1] age edu scoma meanbp wblc hrt resp temp pafi
[10] alb bili crea sod ph glucose bun urine adlsc
$discrete
[1] sex dzgroup dzclass num.co income race adlp adls

Let’s use only the first 9 continuous variables. In addition to showing all the estimated relationships with the outcome, put covariate distributions in collapsed note. Note the bimodality of some of the measurements, and true zero blood pressures for patients having cardiac arrest.

V <- types$continuous[1:9]U <-list()for(v in V) { x <- support[[v]] u <-movStats(Surv(d.time /365.25, death) ~ x, times=1:2,eps=30, hare=TRUE, penalty=0.25, maxdim=10,msmooth='smoothed', bass=8,melt=TRUE, data=support) U[[label(x, default=v)]] <- u # stuffs u in an element of list U# & names the element w/ var label/name}w <-rbindlist(U, idcol='vname') # stack all the data tablesggplot(w, aes(x, y=incidence, col=Statistic, linetype=Type)) +geom_line() +facet_wrap(~ vname, scales='free_x') +ylab(label(u$incidence)) +xlab('') +guides(color=guide_legend(title='')) +theme(legend.position='bottom',strip.text =element_text(size=8))makecnote(`Covariate Distributions`~plot(describe(support[, ..V])))

Covariate Distributions

If we were not showing main graphs in wide format (using a Quarto callout) we could have put the marginal distributions in the right margin using the following, which shrinks the plotly output.

require(plotly) # for %>%pl <-plot(describe(support[, ..V])) %>%layout(autosize=FALSE, width=350, height=325)makecolmarg(~ pl)

Likewise we can produce a graph summarizing how categorical baseline variables relate to the study outcome variable.

V <- types$discrete # or disVars(support, exclude=...)U <-list()for(v in V) { x <- support[[v]] u <-movStats(Surv(d.time /365.25, death) ~ x, times=1:2,discrete=TRUE, melt=TRUE, data=support) U[[label(x, default=v)]] <- u}w <-rbindlist(U, idcol='vname') # stack the tablesggplot(w, aes(x=incidence, y=x, col=Statistic)) +geom_point() +facet_wrap(~ vname, scales='free_y') +xlab(label(u$incidence)) +ylab('') +guides(color=guide_legend(title='')) +theme(legend.position='bottom')

Alternatively we can put each variable in a separate tab:

gg <-function(data)ggplot(data, aes(x=incidence, y=x, col=Statistic)) +geom_point() +xlab('Mortality') +ylab('') +guides(color=guide_legend(title='')) +theme(legend.position='bottom')g <-lapply(U, gg) # one ggplot per element (a data table) in Umaketabs(g, cap=1,basecap='Kaplan-Meier estimates of 1y and 2y incidence with each predictor in its own tab')

Studies almost never randomly sample from a population, hence inference to the population for a single treatment’s outcome should seldom be attempted. The uncertainty intervals and bands that should be presented are ones having inferential meaning and are based on treatment differences. One can easily construct a graph that shows differences and confidence intervals for them, but it is useful to be able to show the individual group estimates along with CIs for the differences. Fortunately, Maarten Boers had the idea of a null bar or null zone. When a confidence interval for a difference is symmetric, the confidence interval includes 0.0 if and only if the midpoint of the two outcome estimates \(\pm \frac{1}{4} \times w\) touches the individual group estimates, where \(w\) is the width of the confidence interval. Null zone/half-width CIs can be put to especially good use in avoiding clutter when displaying Kaplan-Meier plots, and can be graphed using the rms package survplot (static plot) and survplotp (plotly interactive graphic) functions. The latter has the additional advantage of providing continuous data on number of subjects still at risk by hovering over the survival curve for one group. Here is an example using support. Estimate survival differences between patients who were or were not able to be interviewed for determining their baseline activities of daily living.

The primary reason for not being interviewed was the patient needing to be on a ventilator.

Cumulative incidence are recommended over cumulative survival probabilities, principally because many journals will force you to scale the \(y\)-axis for survival probability as \([0,1]\) even in a very low-risk sample, whereas journals do not have silly scaling conventions for cumulative incidence.