5  Missing Data

It is extremely important to understand the extent and patterns of missing data, starting with charting the marginal fraction of observations with NAs for each variable. The occurrence of simultaneous missings on multiple variables makes multiple imputation and analysis more difficult, so it is important to correlate and quantify missingness in variables multiple ways. The Hmisc package naclus, naplot, and combplotp functions provide a number of graphics along these lines.

The missChk function in reptools uses these functions and others to produces a fairly comprehensive missingness report, placing each output in its own tab. When the number of variables containing any NA is small, the Hmisc na.pattern function’s simple output is by default all that is shown, and only one sentence is produced if there are no variables with NAs. Here is an example using again the the 1000-patient support dataset on hbiostat.org/data, retrieved with the Hmisc function getHdata. Variables with no missing values are excluded from the report (except for being used in the predictive model at the end) to save space. The chart in the next-to-last tab is interactive. We also use the prednmiss options to run an ordinal logistic regression model to predict the number of missing variables from the values of all the non-missing variables, omitting the predictor dzclass because it is redundant with the variable dzgroup. The results of this analysis are in the last tab.

getRs('reptools.r', put='source')  # Define dataChk, missChk, maketabs, ...
# Make it into a data table
# Remove one variable we'll not be using
support[, adlsc := NULL]
missChk(support, prednmiss=TRUE, omitpred = ~ dzclass)

15 variables have no NAs and 19 variables have NAs

Sequential frequency-ordered exclusions due to NAs
adlp urine alb pafi income adls bili totcst sfdm2 wblc
634 192 70 34 16 11 6 2 2 1
Logistic Regression Model
 rms::lrm(formula = as.formula(form), data = d)
Frequencies of Responses
   0   1   2   3   4   5   6   7   8   9  10  11  12  13 
  32 100  96 114 134 130 125  90  82  44  34   8   9   2 
Model Likelihood
Ratio Test
Rank Discrim.
Obs 1000 LR χ2 197.64 R2 0.181 C 0.660
max |∂log L/∂β| 2×10-8 d.f. 20 R220,1000 0.163 Dxy 0.319
Pr(>χ2) <0.0001 R220,988.6 0.164 γ 0.319
Brier 0.177 τa 0.287

The likelihood ratio \(\chi^2\) test in the last tab is a test of whether any of a subject’s non-missing variable values are associated with the number of missing variables on the subject. It shows strong evidence for such associations. From the dot chart we see that the strongest predictors of missing baseline variables are time to death/censoring and disease group. This may be due to patients on ventilators not being able to provide as much baseline information such as activities of daily living (adlp), and being on a ventilator is a strong prognostic sign. There is a possible sex effect worth investigating.