6  Missing Data

flowchart LR
Ext[Extent of NAs] --> PV[Per Variable] & PO[Per Observation]
P[Patterns] --> Cl[Clustering of Missingness]
P --> Seq[Sequential Exclusions]
P --> Rel[Extent of Association<br>Between Values of<br>Non-missing Variables<br>and Number of Variables<br>Missing per Observation]

It is extremely important to understand the extent and patterns of missing data, starting with charting the marginal fraction of observations with NAs for each variable. The occurrence of simultaneous missings on multiple variables makes multiple imputation and analysis more difficult, so it is important to correlate and quantify missingness in variables multiple ways. The Hmisc package naclus, naplot, and combplotp functions provide a number of graphics along these lines.

The missChk function in reptools uses these functions and others to produces a fairly comprehensive missingness report, placing each output in its own tab. When the number of variables containing any NA is small, the Hmisc na.pattern function’s simple output is by default all that is shown, and only one sentence is produced if there are no variables with NAs. Here is an example using again the the 1000-patient support dataset on hbiostat.org/data, retrieved with the Hmisc function getHdata. Variables with no missing values are excluded from the report (except for being used in the predictive model at the end) to save space. The chart in the next-to-last tab is interactive. We also use the prednmiss options to run an ordinal logistic regression model to predict the number of missing variables from the values of all the non-missing variables, omitting the predictor dzclass because it is redundant with the variable dzgroup. The results of this analysis are in the last tab.

require(Hmisc)
require(data.table)
getRs('reptools.r')  # Define dataChk, missChk, maketabs, ...
getHdata(support)
# Make it into a data table
setDT(support)
# Remove one variable we'll not be using
support[, adlsc := NULL]
missChk(support, prednmiss=TRUE, omitpred = ~ dzclass)

15 variables have no NAs and 19 variables have NAs

support has 1000 observations (32 complete) and 34 variables (15 complete)

Number of NAs
Minimum Maximum Mean
Per variable 0 634 141.6
Per observation 0 13 4.8
Frequency distribution of number of NAs per variable
0 3 5 6 24 25 105 159 202 250 253 297 310 349 372 378 455 470 517 634
15 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Frequency distribution of number of incomplete variables per observation
0 1 2 3 4 5 6 7 8 9 10 11 12 13
32 100 96 114 134 130 125 90 82 44 34 8 9 2

Figure 6.1: Missing data patterns in support

Sequential frequency-ordered exclusions due to NAs
adlp urine alb pafi income adls bili totcst sfdm2 wblc
634 192 70 34 16 11 6 2 2 1
Logistic Regression Model
 rms::lrm(formula = as.formula(form), data = d)
 
Frequencies of Responses
   0   1   2   3   4   5   6   7   8   9  10  11  12  13 
  32 100  96 114 134 130 125  90  82  44  34   8   9   2 
 
Model Likelihood
Ratio Test
Discrimination
Indexes
Rank Discrim.
Indexes
Obs 1000 LR χ2 197.64 R2 0.181 C 0.660
max |∂log L/∂β| 2×10-8 d.f. 20 R220,1000 0.163 Dxy 0.319
Pr(>χ2) <0.0001 R220,988.6 0.164 γ 0.319
Brier 0.177 τa 0.287

The likelihood ratio \(\chi^2\) test in the last tab is a test of whether any of a subject’s non-missing variable values are associated with the number of missing variables on the subject. It shows strong evidence for such associations. From the dot chart we see that the strongest predictors of missing baseline variables are time to death/censoring and disease group. This may be due to patients on ventilators not being able to provide as much baseline information such as activities of daily living (adlp), and being on a ventilator is a strong prognostic sign. There is a possible sex effect worth investigating.