## Code

```
<- aregImpute(~ age + sex + bp + death + heart.attack.before.death,
a data=mydata, n.impute=5)
<- fit.mult.impute(death ~ rcs(age,3) + sex +
f rcs(bp,5), lrm, a, data=mydata)
```

- Missing completely at random (MCAR)
- Missing at random (MAR)
^{1} - Informative missing (non-ignorable non-response)

A

^{1} “Although missing at random (MAR) is a non-testable assumption, it has been pointed out in the literature that we can get very close to MAR if we include enough variables in the imputation models” Harel & Zhou (2007).

See Carpenter & Smuk (2021), Schafer & Graham (2002), Donders et al. (2006), Harel & Zhou (2007), Allison (2001), White et al. (2011), Stef Buuren (2012) for an introduction to missing data and imputation concepts.

- Quantify extent of missing data
- Characterize types of subjects with missing data
- Find sets of variables that are missing on same subjects

B

- Serial data with subjects dropping out (not covered in this course
^{2} - \(Y\)=time to event, follow-up curtailed: covered under survival analysis
^{3} - Often discard observations with completely missing \(Y\) but sometimes wasteful
^{4} - Characterize missings in \(Y\) before dropping obs.

C

^{2} Twisk et al. (2013) found instability in using multiple imputation of longitudinal data, and advantages of using instead full likelihood models.

^{3} White & Royston (2009) provide a method for multiply imputing missing covariate values using censored survival time data.

^{4} \(Y\) is so valuable that if one is only missing a \(Y\) value, imputation is not worthwhile, and imputation of \(Y\) is not advised if MCAR or MAR.

Deletion of records—

- Badly biases parameter estimates when the probability of a case being incomplete is related to \(Y\) and not just \(X\) (Little & Rubin, 2002).
- Deletion because of a subset of \(X\) being missing always results in inefficient estimates
- Deletion of records with missing \(Y\) can result in biases (Crawford et al., 1995) but is the preferred approach under MCAR
^{5} - von Hippel (2007) found advantages to a “use all variables to impute all variables then drop observations with missing \(Y\)” approach (but see Sullivan et al. (2015))
- Lee & Carlin (2012) suggest that observations missing on both \(Y\) and on a predictor of major interest are not helpful
- Only discard obs. when
- MCAR can be justified
- Rarely missing predictor of overriding importance that can’t be imputed from other data
- Fraction of obs. with missings small and \(n\) is large

- No advantage of deletion except savings of analyst time
- Making up missing data better than throwing away real data
- See Knol et al. (2010)

D

^{5} Multiple imputation of \(Y\) in that case does not improve the analysis and assumes the imputation model is correct.

Adding extra categories of categorical predictors—

- Including missing data but adding a category `missing’ causes serious biases (Allison, 2001; Jones, 1996; Vach & Blettner, 1998)
- Problem acute when values missing because subject too sick
- Difficult to interpret
- Fails even under MCAR (Allison, 2001; Donders et al., 2006; Jones, 1996; Knol et al., 2010; van der Heijden et al., 2006)
- May be OK if values are “missing” because of “not applicable”
^{6}

E

^{6} E.g. you have a measure of marital happiness, dichotomized as high or low, but your sample contains some unmarried people. OK to have a 3-category variable with values high, low, and unmarried—Paul Allison, IMPUTE list, 4Jul09.

Likewise, serious problems are caused by setting missing continuous predictors to a constant (e.g., zero) and adding an indicator variable to try to estimate the effect of missing values.

Two examples from Donders et al. (2006) using binary logistic regression, \(N=500\).

Results of 1000 Simulations With \(\beta_{1}=1.0\) with MAR and Two Types of Imputation

F

Imputation | \(\hat{\beta}_{1}\) | S.E. | Coverage of |
---|---|---|---|

Method | 0.90 C.I. | ||

Single | 0.989 | 0.09 | 0.64 |

Multiple | 0.989 | 0.14 | 0.90 |

Now consider a simulation with \(\beta_{1}=1, \beta_{2}=0\), \(X_{2}\) correlated with \(X_{1} (r=0.75)\) but redundant in predicting \(Y\), use missingness indicator when \(X_{1}\) is MCAR in 0.4 of 500 subjects. This is also compared with grand mean fill-in imputation.

Results of 1000 Simulations Adding a Third Predictor Indicating Missing for \(X_{1}\)}

G

Imputation | \(\hat{\beta}_{1}\) | \(\hat{\beta}_{2}\) |
---|---|---|

Method | ||

Indicator | 0.55 | 0.51 |

Overall mean | 0.55 |

In the incomplete observations the constant \(X_{1}\) is uncorrelated with \(X_{2}\).

**The goal of imputation is to preserve the information and meaning of the non-missing data.**

There is a full Bayesian modeling alternative to all the methods presented below. The Bayesian approach requires more effort but has several advantages (Erler et al., 2016).

Exactly how are missing values estimated?

- Could ignore all other information — random or grand mean fill-in
- Can use external info not used in response model (e.g., zip code for income)
- Need to utilize reason for non-response if possible
- Use statistical model with sometimes-missing \(X\) as response variable
- Model to estimate the missing values should include all variables that are either

HI

- related to the missing data mechanism;
- have distributions that differ between subjects that have the target variable missing and those that have it measured;
- associated with the sometimes-missing variable when it is not missing; or
- included in the final response model (Barzi & Woodward, 2004; Harel & Zhou, 2007)

- Ignoring imputation results in biased \(\hat{V}(\hat{\beta})\)
`transcan`

function in Hmisc library: “optimal” transformations of all variables to make residuals more stable and to allow non-monotonic transformations`aregImpute`

function in Hmisc: good approximation to full Bayesian multiple imputation procedure using the bootstrap`transcan`

and`aregImpute`

use the following for fitting imputation models:

J

- initialize
`NA`

s to median (mode for categoricals) - expand all categorical predictors using dummy variables
- expand all continuous predictors using restricted cubic splines
- optionally optimally transform the variable being predicted by expanding it with restricted cubic splines and using the first canonical variate (multivariate regression) as the optimum transformation (maximizing \(R^2\))
- one-dimensional scoring of categorical variables being predicted using canonical variates on dummy variables representing the categories (Fisher’s optimum scoring algorithm); when imputing categories, solve for which category yields a score that is closest to the predicted score

`aregImpute`

and`transcan`

work with`fit.mult.impute`

to make final analysis of response variable relatively easy- Predictive mean matching (Little & Rubin, 2002): replace missing value with observed value of subject having closest predicted value to the predicted value of the subject with the
`NA`

. Key considerations are how to

K

- model the target when it is not
`NA`

- match donors on predicted values
- avoid overuse of “good” donors to disallow excessive ties in imputed data
- account for all uncertainties

- Predictive model for each target uses any outcomes, all predictors in the final model other than the target, plus auxiliary variables not in the outcome model
- No distributional assumptions; nicely handles target variables with strange distributions (Vink et al., 2014)
- Predicted values need only be monotonically related to real predictive values
- PMM can result in some donor observations being used repeatedly
- Causes lumpy distribution of imputed values
- Address by sampling from multinomial distribution, probabilities = scaled distance of all predicted values to predicted value (\(y^{*}\)) of observation needing imputing
- Tukey’s tricube function is a good weighting function (used in loess): \(w_{i} = (1 - \min(d_{i}/s, 1)^{3})^{3}\),

\(d_{i} = |\hat{y_{i}} - y^{*}|\)

\(s = 0.2\times\text{mean} |\hat{y_{i}} - y^{*}|\) is a good default scale factor

scale so that \(\sum w_{i} = 1\)

- Recursive partitioning with surrogate splits — handles case where a predictor of a variable needing imputation is missing itself. But there are problems (Penning et al., 2018) even with completely random missingness.
- White et al. (2011) discusses an alternative method based on choosing a donor observation at random from the \(q\) closest matches (\(q=3\), for example)

L

- When interactions are in the outcome model, oddly enough it may be better to treat interaction terms as “just another variable” and do unconstrained imputation of them (Kim et al., 2015)

M

- Can fill-in using unconditional mean or median if number of missings low and \(X\) is unrelated to other \(X\)s
- Otherwise, first approximation to good imputation uses other \(X\)s to predict a missing \(X\)
- This is a single “best guess” conditional mean
- \(\hat{X}_{j} = Z \hat{\theta}, Z = X_{\bar j}\) plus possibly auxiliary variables that precede \(X_{j}\) in the causal chain that are not intended to be in the outcome model.

Cannot include \(Y\) in \(Z\) without adding random errors to imputed values as done with multiple imputation (would steal info from \(Y\)) - Recursive partitioning can sometimes be helpful for nonparametrically estimating conditional means

N

- Single imputation could use a random draw from the conditional distribution for an individual

\(\hat{X}_{j} = Z \hat{\theta} + \hat{\epsilon}, Z = [X\bar{j}, Y]\) plus auxiliary variables

\(\hat{\epsilon} = n(0, \hat{\sigma})\) or a random draw from the calculated residuals- bootstrap
- approximate Bayesian bootstrap (Harel & Zhou, 2007; Rubin & Schenker, 1991): sample with replacement from sample with replacement of residuals

- Multiple imputations (\(M\)) with random draws
- Draw sample of \(M\) residuals for each missing value to be imputed
- Average \(M\) \(\hat{\beta}\)
- In general can provide least biased estimates of \(\beta\)
- Simple formula for imputation-corrected var(\(\hat{\beta}\))

Function of average “apparent” variances and between-imputation variances of \(\hat{\beta}\) - Even when the \(\chi^2\) distribution is a good approximation when data have no missing values, the \(t\) or \(F\) distributions are needed to have accurate \(P\)-values and confidence limits when there are missings (Lipsitz et al., 2002; Reiter, 2007)
**BUT**full multiple imputation needs to account for uncertainty in the imputation models by refitting these models for each of the \(M\) draws`transcan`

does not do that;`aregImpute`

does

- Note that multiple imputation can and should use the response variable for imputing predictors (Moons et al., 2006)
`aregImpute`

algorithm (Moons et al., 2006)- Takes all aspects of uncertainty into account using the bootstrap
- Different bootstrap resamples used for each imputation by fitting a flexible additive model on a sample with replacement from the original data
- This model is used to predict all of the original missing and non-missing values for the target variable for the current imputation
- Uses flexible parametric additive regression models to impute
- There is an option to allow target variables to be optimally transformed, even non-mono-ton-ical-ly (but this can overfit)
- By default uses predictive mean matching for imputation; no residuals required (can also do more parametric regression imputation)
- By default uses weighted PMM; many other matching options
- Uses by default van~Buuren’s “Type 1” matching to capture the right amount of uncertainty by computing predicted values for missing values using a regression fit on the bootstrap sample, and finding donor observations by matching those predictions to predictions from potential donors using the regression fit from the original sample of complete observations
- When a predictor of the target variable is missing, it is first imputed from its last imputation when it was a target variable
- First 3 iterations of process are ignored (“burn-in”)
- Compares favorably to
`R`

`MICE`

approach - Example:

OP

See Barzi & Woodward (2004) for a nice review of multiple imputation with detailed comparison of results (point estimates and confidence limits for the effect of the sometimes-missing predictor) for various imputation methods. Barnes et al. (2006) have a good overview of imputation methods and a comparison of bias and confidence interval coverage for the methods when applied to longitudinal data with a small number of subjects. Horton & Kleinman (2007) have a good review of several software packages for dealing with missing data, and a comparison of them with `aregImpute`

. Harel & Zhou (2007) provide a nice overview of multiple imputation and discuss some of the available software. White & Carlin (2010) studied bias of multiple imputation vs. complete-case analysis. White et al. (2011) provide much practical guidance.

**Caution**: Methods can generate imputations having very reasonable distributions but still not having the property that final response model regression coefficients have nominal confidence interval coverage. It is worth checking that imputations generate the correct collinearities among covariates.

- With
`MICE`

and`aregImpute`

we are using the chained equation approach (White et al., 2011) - Chained equations handles a wide variety of target variables to be imputed and allows for multiple variables to be missing on the same subject
- Iterative process cycles through all target variables to impute all missing values (S. van Buuren et al., 2006)
- Does not attempt to use the full Bayesian multivariate model for all target variables, making it more flexible and easy to use
- Possible to create improper imputations, e.g., imputing conflicting values for different target variables
- However, simulation studies (S. van Buuren et al., 2006) demonstrate very good performance of imputation based on chained equations

QR

- MCAR can be partially assessed by comparing distribution of non-missing \(Y\) for those subjects with complete \(X\) vs. those subjects having incomplete \(X\) (Little & Rubin, 2002)
- Yucel and Zaslavsky (Yucel & Zaslavsky, 2008; see also He & Zaslavsky, 2012)
- Interested in reasonableness of imputed values for a sometimes-missing predictor \(X_{j}\)
- Duplicate entire dataset
- In the duplicated observations set all non-missing values of \(X_{j}\) to missing; let \(w\) denote this set of observations set to missing
- Develop imputed values for the missing values of \(X_{j}\)
- In the observations in \(w\) compare the distribution of imputed \(X_{j}\) to the original values of \(X_{j}\)
- Bondarenko & Raghunathan (2016) present a variety of useful diagnostics on the reasonableness of imputed values.

S

T

Method: | Deletion | Single | Multiple |
---|---|---|---|

Allows non-random missing | x | x | |

Reduces sample size | x | ||

Apparent S.E. of \(\hat{\beta}\) too low | x | ||

Increases real S.E. of \(\hat{\beta}\) | x | ||

\(\hat{\beta}\) biased | if not MCAR | x |

The following contains crude guidelines. Simulation studies are needed to refine the recommendations. Here \(f\) refers to the proportion of observations having *any* variables missing.

**\(f < 0.03\):**It doesn’t matter very much how you impute missings or whether you adjust variance of regression coefficient estimates for having imputed data in this case. For continuous variables imputing missings with the median non-missing value is adequate; for categorical predictors the most frequent category can be used. Complete case analysis is also an option here. Multiple imputation may be needed to check that the simple approach “worked.”**\(f \geq 0.03\):**Use multiple imputation with number of imputations^{7}equal to \(\max(5, 100f)\). Fewer imputations may be possible with very large sample sizes. See statisticalhorizons.com/how-many-imputations. Type 1 predictive mean matching is usually preferred, with weighted selection of donors. Account for imputation in estimating the covariance matrix for final parameter estimates. Use the \(t\) distribution instead of the Gaussian distribution for tests and confidence intervals, if possible, using the estimated d.f. for the parameter estimates.**Multiple predictors frequently missing:**More imputations may be required. Perform a “sensitivity to order” analysis by creating multiple imputations using different orderings of sometimes missing variables. It may be beneficial to initially sort variables so that the one with the most`NA`

s will be imputed first.

U V

^{7} White et al. (2011) recommend choosing \(M\) so that the key inferential statistics are very reproducible should the imputation analysis be repeated. They suggest the use of \(100f\) imputations. See also (Stef Buuren, 2012, sec. 2.7). von Hippel (2016) finds that the number of imputations should be quadratically increasing with the fraction of missing information.

Reason for missings more important than number of missing values.

Extreme amount of missing data does not prevent one from using multiple imputation, because alternatives are worse (Janssen et al., 2010; Madley-Dowd et al., 2019).

It is useful to look look at examples of effective sample sizes in the presence of missing data. If a sample of 1000 subjects contains various amounts and patterns of missings what size \(n_c\) of a complete sample would have equivalent information for the intended purpose of the analysis?

- A new marker was collected on a random sample of 200 of the subjects and one wants to estimate the added predictive value due to the marker: \(n_{c}=200\)
- Height is missing on 100 subjects but we want to study association between BMI and outcome. Weight, sex, and waist circumference are available on all subjects: \(n_{c}=980\)
- Each of 10 predictors is randomly missing on \(\frac{1}{10}\) of subjects, and the predictors are uncorrelated with each other and are each weakly related to the outcome: \(n_{c}=500\)
- Same as previous but the predictors can somewhat be predicted from non-missing predictors: \(n_{c}=750\)
- The outcome variable was not assessed on a random \(\frac{1}{5}\) of subjects: \(n_{c}=800\)
- The outcome represents sensitive information, is missing on \(\frac{1}{2}\) of subjects, and we don’t know what made subjects respond to the question: \(n_{c}=0\) (serious selection bias)
- One of the baseline variables was collected prospectively \(\frac{1}{2}\) of the time and for the other subjects it was retrospectively estimated only for subjects ultimately suffering a stroke and we don’t know which subjects had a stroke: \(n_{c}=0\) (study not worth doing)
- The outcome variable was assessed by emailing the 1000 subjects, for which 800 responded, and we don’t know what made subjects respond: \(n_{c}=0\) (model will possibly be very biased—at least the intercept)

W

- Multiple imputation developed as an approximation to a full Bayesian model
- Full Bayesian model treats missings as unknown parameters and provides exact inference and correct measures of uncertainty
- See this case study for an example
- The case study also shows how to do “posterior stacking” if you want to avoid having to specify a full model for missings, and instead use usual multiple imputations as described in this chapter
- Run a multiple imputation algorithm
- For each completed dataset run the Bayesian analysis and draw thousands of samples from the posterior distribution of the parameters
- Pool all these posterior draws over all the multiple imputations and do posterior inference as usual with no special correction required
- Made easy by the
`Hmisc`

package`aregImpute`

function and the`rms`

`stackMI`

function as demonstrated in the Titanic case study later in the notes.

X

Allison, P. D. (2001). *Missing Data*. Sage.

Barnes, S. A., Lindborg, S. R., & Seaman, J. W. (2006). Multiple imputation techniques in small sample clinical trials. *Stat Med*, *25*, 233–245.

bad performance of LOCF including high bias and poor confidence interval coverage;simulation setup;longitudinal data;serial data;RCT;dropout;assumed missing at random (MAR);approximate Bayesian bootstrap;Bayesian least squares;missing data;nice background summary;new completion score method based on fitting a Poisson model for the number of completed clinic visits and using donors and approximate Bayesian bootstrap

Barzi, F., & Woodward, M. (2004). Imputations of missing values in practice: Results from imputations of serum cholesterol in 28 cohort studies. *Am J Epi*, *160*, 34–45.

excellent review article for multiple imputation;list of variables to include in imputation model;"Imputation models should ideally include all covariates that are related to the missing data mechanism, have distributions that differ between the respondents and nonrespondents, are associated with cholesterol, and will be included in the analyses of the final complete data sets";detailed comparison of results (cholesterol effect and confidence limits) for various imputation methods

Bondarenko, I., & Raghunathan, T. (2016). Graphical and numerical diagnostic tools to assess suitability of multiple imputations and imputation models. *Stat Med*, *35*(17), 3007–3020. https://doi.org/10.1002/sim.6926

Buuren, Stef. (2012). *Flexible imputation of missing data*. Chapman & Hall/CRC. https://doi.org/10.1201/b11826

Carpenter, J. R., & Smuk, M. (2021). Missing data: A statistical framework for practice. *Biometrical Journal*, *63*(5), 915–947. https://doi.org/10.1002/bimj.202000196

Crawford, S. L., Tennstedt, S. L., & McKinlay, J. B. (1995). A comparison of analytic methods for non-random missingness of outcome data. *J Clin Epi*, *48*, 209–219.

Donders, van der Heijden, G. J. M. G., Stijnen, T., & Moons, K. G. M. (2006). Review: A gentle introduction to imputation of missing values. *J Clin Epi*, *59*, 1087–1091.

simple demonstration of failure of the add new category method (indicator variable)

Erler, N. S., Rizopoulos, D., Rosmalen, J., Jaddoe, V. W. V., Franco, O. H., & Lesaffre, E. M. E. H. (2016). Dealing with missing covariates in epidemiologic studies: A comparison between multiple imputation and a full Bayesian approach. *Stat Med*, *35*(17), 2955–2974. https://doi.org/10.1002/sim.6944

Harel, O., & Zhou, X.-H. (2007). Multiple imputation: Review of theory, implementation and software. *Stat Med*, *26*, 3057–3077.

failed to review aregImpute;excellent overview;ugly S code;nice description of different statistical tests including combining likelihood ratio tests (which appears to be complex, requiring an out-of-sample log likelihood computation);congeniality of imputation and analysis models;Bayesian approximation or approximate Bayesian bootstrap overview;"Although missing at random (MAR) is a non-testable assumption, it has been pointed out in the literature that we can get very close to MAR if we include enough variables in the imputation models ... it would be preferred if the missing data modelling was done by the data constructors and not by the users... MI yields valid inferences not only in congenial settings, but also in certain uncongenial ones as well—where the imputer’s model (1) is more general (i.e. makes fewer assumptions) than the complete-data estimation method, or when the imputer’s model makes additional assumptions that are well-founded."

He, Y., & Zaslavsky, A. M. (2012). Diagnosing imputation models by applying target analyses to posterior replicates of completed data. *Stat Med*, *31*(1), 1–18. https://doi.org/10.1002/sim.4413

Horton, N. J., & Kleinman, K. P. (2007). Much ado about nothing: A comparison of missing data methods and software to fit incomplete data regression models. *Am Statistician*, *61*(1), 79–90.

Janssen, K. J., Donders, A. R., Harrell, F. E., Vergouwe, Y., Chen, Q., Grobbee, D. E., & Moons, K. G. (2010). Missing covariate data in medical research: To impute is better than to ignore. *J Clin Epi*, *63*, 721–727.

Jones, M. P. (1996). Indicator and stratification methods for missing explanatory variables in multiple linear regression. *J Am Stat Assoc*, *91*, 222–230.

Kim, S., Sugar, C. A., & Belin, T. R. (2015). Evaluating model-based imputation methods for missing covariates in regression models with interactions. *Stat Med*, *34*(11), 1876–1888. https://doi.org/10.1002/sim.6435

Knol, M. J., Janssen, K. J. M., Donders, R. T., Egberts, A. C. G., Heerding, E. R., Grobbee, D. E., Moons, K. G. M., & Geerlings, M. I. (2010). Unpredictable bias when using the missing indicator method or complete case analysis for missing confounder values: An empirical example. *J Clin Epi*, *63*, 728–736.

Lee, K. J., & Carlin, J. B. (2012). Recovery of information from multiple imputation: A simulation study. *Emerg Themes Epi*, *9*(1), 3+. https://doi.org/10.1186/1742-7622-9-3

Not sure that the authors satisfactorily dealt with nonlinear predictor effectsin the absence of strong auxiliary information, there is little to gain from multiple imputation with missing data in the exposure-of-interest. In fact, the authors went further to say that multiple imputation can introduce bias not present in a complete case analysis if a poorly fitting imputation model is used [from Yong Hao Pua]

Lipsitz, S., Parzen, M., & Zhao, L. P. (2002). A Degrees-Of-Freedom approximation in Multiple imputation. *J Stat Comp Sim*, *72*(4), 309–318. https://doi.org/10.1080/00949650212848

Little, R. J. A., & Rubin, D. B. (2002). *Statistical Analysis with Missing Data* (second). Wiley.

Madley-Dowd, P., Hughes, R., Tilling, K., & Heron, J. (2019). The proportion of missing data should not be used to guide decisions on multiple imputation. *Journal of Clinical Epidemiology*, *110*, 63–73. https://doi.org/10.1016/j.jclinepi.2019.02.016

Moons, K. G. M., Donders, R. A. R. T., Stijnen, T., & Harrell, F. E. (2006). Using the outcome for imputation of missing predictor values was preferred. *J Clin Epi*, *59*, 1092–1101. https://doi.org/10.1016/j.jclinepi.2006.01.009

use of outcome variable; excellent graphical summaries of simulations

Penning, de V. B. B. L., van, S. M., & Groenwold, R. H. H. (2018). Propensity Score Estimation Using Classification and Regression Trees in the Presence of Missing Covariate Data. *Epidemiologic Methods*, *7*(1). https://doi.org/10.1515/em-2017-0020

Reiter, J. P. (2007). Small-sample degrees of freedom for multi-component significance tests with multiple imputation for missing data. *Biometrika*, *94*(2), 502–508. https://doi.org/10.1093/biomet/asm028

Rubin, D., & Schenker, N. (1991). Multiple imputation in health-care data bases: An overview and some applications. *Stat Med*, *10*, 585–598.

Schafer, J. L., & Graham, J. W. (2002). Missing data: Our view of the state of the art. *Psych Meth*, *7*, 147–177.

excellent review and overview of missing data and imputation;problems with MICE;less technical description of 3 types of missing data

Sullivan, T. R., Salter, A. B., Ryan, P., & Lee, K. J. (2015). Bias and Precision of the “Multiple Imputation, Then Deletion” Method for Dealing With Missing Outcome Data. *American Journal of Epidemiology*, *182*(6), 528–534. https://doi.org/10.1093/aje/kwv100

Disagrees with von Hippel approach of "impute then delete" for Y

Twisk, J., de Boer, M., de Vente, W., & Heymans, M. (2013). Multiple imputation of missing values was not necessary before performing a longitudinal mixed-model analysis. *J Clin Epi*, *66*(9), 1022–1028. https://doi.org/10.1016/j.jclinepi.2013.03.017

Vach, W., & Blettner, M. (1998). Missing Data in Epidemiologic Studies. In *Ency of Biostatistics* (pp. 2641–2654). Wiley.

van Buuren, S., Brand, J. P. L., Groothuis-Oudshoorn, C. G. M., & Rubin, D. B. (2006). Fully conditional specification in multivariate imputation. *J Stat Computation Sim*, *76*(12), 1049–1064.

justification for chained equations alternative to full multivariate modeling

van der Heijden, G. J. M. G., Donders, Stijnen, T., & Moons, K. G. M. (2006). Imputation of missing values is superior to complete case analysis and the missing-indicator method in multivariable diagnostic research: A clinical example. *J Clin Epi*, *59*, 1102–1109. https://doi.org/10.1016/j.jclinepi.2006.01.015

Invalidity of adding a new category or an indicator variable for missing values even with MCAR

Vink, G., Frank, L. E., Pannekoek, J., & van Buuren, S. (2014). Predictive mean matching imputation of semicontinuous variables. *Statistica Neerlandica*, *68*(1), 61–90. https://doi.org/10.1111/stan.12023

von Hippel, P. T. (2016). *The number of imputations should increase quadratically with the fraction of missing information*. http://arxiv.org/abs/1608.05406

von Hippel, P. T. (2007). Regression with missing Ys: An improved strategy for analyzing multiple imputed data. *Soc Meth*, *37*(1), 83–117.

White, I. R., & Carlin, J. B. (2010). Bias and efficiency of multiple imputation compared with complete-case analysis for missing covariate values. *Stat Med*, *29*, 2920–2931.

White, I. R., & Royston, P. (2009). Imputing missing covariate values for the Cox model. *Stat Med*, *28*, 1982–1998.

approach to using event time and censoring indicator as predictors in the imputation model for missing baseline covariates;recommended an approximation using the event indicator and the cumulative hazard transformation of time, without their interaction

White, I. R., Royston, P., & Wood, A. M. (2011). Multiple imputation using chained equations: Issues and guidance for practice. *Stat Med*, *30*(4), 377–399.

practical guidance for the use of multiple imputation using chained equations;MICE;imputation models for different types of target variables;PMM choosing at random from among a few closest matches;choosing number of multiple imputations by a reproducibility argument, suggesting 100f imputations when f is the fraction of cases that are incomplete

Yucel, R. M., & Zaslavsky, A. M. (2008). Using calibration to improve rounding in imputation. *Am Statistician*, *62*(2), 125–129.

using rounding to impute binary variables using techniques for continuous data;uses the method to solve for the cutpoint for a continuous estimate to be converted into a binary value;method should be useful in more general situations;idea is to duplicate the entire dataset and in the second half of the new datasets to set all non-missing values of the target variable to missing;multiply impute these now-missing values and compare them to the actual values

```
```{r include=FALSE}
options(qproject='rms', prType='html')
require(Hmisc)
getRs('reptools.r')
getRs('qbookfun.r')
hookaddcap()
knitr::set_alias(w = 'fig.width', h = 'fig.height', cap = 'fig.cap', scap ='fig.scap')
```
# Missing Data {#sec-missing-data}
## Types of Missing Data
`r mrg(sound("missing-1"))`
* Missing completely at random (MCAR) `r ipacue()`
* Missing at random (MAR)^["Although missing at random (MAR) is a non-testable assumption, it has been pointed out in the literature that we can get very close to MAR if we include enough variables in the imputation models" @har07mul.]
* Informative missing (non-ignorable non-response)
See @car21mis, @sch02mis, @don06rev, @har07mul, @all01mis, @whi11mul, @buu12fle for an
introduction to missing data and imputation concepts.
## Prelude to Modeling
* Quantify extent of missing data `r ipacue()`
* Characterize types of subjects with missing data
* Find sets of variables that are missing on same subjects
## Missing Values for Different Types of Response Variables
* Serial data with subjects dropping out (not covered in this `r ipacue()`
course^[@twi13mul found instability in using multiple imputation of longitudinal data, and advantages of using instead full likelihood models.]
<!-- TODO twi13mul is epub--->
* $Y$=time to event, follow-up curtailed: covered under survival
analysis^[@whi09imp provide a method for multiply imputing missing covariate values using censored survival time data.]
* Often discard observations with completely missing $Y$ but sometimes wasteful^[$Y$ is so valuable that if one is only missing a $Y$ value, imputation is not worthwhile, and imputation of $Y$ is not advised if MCAR or MAR.]
* Characterize missings in $Y$ before dropping obs.
## Problems With Simple Alternatives to Imputation
Deletion of records---
* Badly biases parameter estimates when the probability of a `r ipacue()`
case being incomplete is related to $Y$ and not just
$X$ [@littlerubin].
* Deletion because of a subset of $X$ being missing
always results in inefficient estimates
* Deletion of records with missing $Y$ can result in
biases [@cra95com] but is the preferred approach
under MCAR^[Multiple imputation of $Y$ in that case does not improve the analysis and assumes the imputation model is correct.]
* @hip07reg found advantages to a "use
all variables to impute all variables then drop observations with
missing $Y$" approach (but see @sul15bia)
* @lee12rec suggest that observations missing
on both $Y$ and on a predictor of major interest are not helpful
* Only discard obs. when
+ MCAR can be justified
+ Rarely missing predictor of overriding importance that can't be
imputed from other data
+ Fraction of obs. with missings small and $n$ is large
* No advantage of deletion except savings of analyst time
* Making up missing data better than throwing away real data
* See @kno10unp
Adding extra categories of categorical predictors---
* Including missing data but adding a category `missing' causes `r ipacue()`
serious biases [@all01mis; @jon96ind; @vac98mis]
* Problem acute when values missing because subject too sick
* Difficult to interpret
* Fails even under MCAR [@jon96ind; @all01mis; @don06rev; @hei06imp; @kno10unp]
* May be OK if values are "missing" because of "not
applicable"^[E.g. you have a measure of marital happiness, dichotomized as high or low, but your sample contains some unmarried people. OK to have a 3-category variable with values high, low, and unmarried---Paul Allison, IMPUTE list, 4Jul09.]
Likewise, serious problems are caused by setting missing continuous
predictors to a constant (e.g., zero) and adding an indicator variable
to try to estimate the effect of missing values.
Two examples from @don06rev using binary logistic
regression, $N=500$.
Results of 1000 Simulations With $\beta_{1}=1.0$ with MAR
and Two Types of Imputation `r ipacue()`
| Imputation | $\hat{\beta}_{1}$ | S.E. | Coverage of |
|-----|-----|-----|-----|
| Method | | | 0.90 C.I. |
| Single | 0.989 | 0.09 | 0.64 |
| Multiple | 0.989 | 0.14 | 0.90 |
Now consider a simulation with $\beta_{1}=1, \beta_{2}=0$, $X_{2}$
correlated with $X_{1} (r=0.75)$ but redundant in predicting $Y$, use
missingness indicator when $X_{1}$ is MCAR in 0.4 of 500 subjects.
This is also compared with grand mean fill-in imputation.
Results of 1000 Simulations Adding a Third Predictor Indicating Missing for $X_{1}$} `r ipacue()`
| Imputation | $\hat{\beta}_{1}$ | $\hat{\beta}_{2}$ |
|-----|-----|-----|
| Method | | |
| Indicator | 0.55 | 0.51 |
| Overall mean| 0.55 | |
In the incomplete observations the constant $X_{1}$ is uncorrelated
with $X_{2}$.
## Strategies for Developing an Imputation Model
**The goal of imputation is to preserve the information and meaning of the non-missing data.**
`r mrg(sound("missing-2"))`
There is a full Bayesian modeling alternative to all the methods
presented below. The Bayesian approach requires more effort but has
several advantages [@erl16dea].
Exactly how are missing values estimated?
* Could ignore all other information --- random or grand mean `r ipacue()`
fill-in
* Can use external info not used in response model (e.g., zip
code for income)
* Need to utilize reason for non-response if possible
* Use statistical model with sometimes-missing $X$ as response
variable
* Model to estimate the missing values should include all
variables that are either `r ipacue()`
1. related to the missing data mechanism;
1. have distributions that differ between subjects that have the
target variable missing and those that have it measured;
1. associated with the sometimes-missing variable when it is not
missing; or
1. included in the final response model [@bar04imp; @har07mul]
* Ignoring imputation results in biased $\hat{V}(\hat{\beta})$
* `transcan` function in Hmisc library: "optimal"
transformations of all variables to make residuals more stable and
to allow non-monotonic transformations
* `aregImpute` function in Hmisc: good approximation to full
Bayesian multiple imputation procedure using the bootstrap
* `transcan` and `aregImpute` use the following for fitting
imputation models: `r ipacue()`
1. initialize `NA`s to median (mode for categoricals)
1. expand all categorical predictors using dummy variables
1. expand all continuous predictors using restricted cubic splines
1. optionally optimally transform the variable being predicted by
expanding it with restricted cubic splines and using the first
canonical variate (multivariate regression) as the optimum
transformation (maximizing $R^2$)
1. one-dimensional scoring of categorical variables being
predicted using canonical variates on dummy variables representing
the categories (Fisher's optimum scoring algorithm); when imputing
categories, solve for which category yields a score that is
closest to the predicted score
* `aregImpute` and `transcan` work with `r ipacue()`
`fit.mult.impute` to make final analysis of response variable
relatively easy
* Predictive mean matching [@littlerubin]: replace missing
value with observed value of subject having closest predicted
value to the predicted value of the subject with the `NA`.
Key considerations are how to
1. model the target when it is not `NA`
1. match donors on predicted values
1. avoid overuse of "good" donors to disallow excessive ties in
imputed data
1. account for all uncertainties
* Predictive model for each target uses any outcomes, all
predictors in the final model other than the target, plus
auxiliary variables not in the outcome model
* No distributional assumptions; nicely handles target variables
with strange distributions [@vin14pre]
* Predicted values need only be monotonically related to real
predictive values
+ PMM can result in some donor observations being used repeatedly `r ipacue()`
+ Causes lumpy distribution of imputed values
+ Address by sampling from multinomial distribution,
probabilities = scaled distance of all predicted values to
predicted value ($y^{*}$) of observation needing imputing
+ Tukey's tricube function is a good weighting function (used in
loess): $w_{i} = (1 - \min(d_{i}/s, 1)^{3})^{3}$, <br>
$d_{i} = |\hat{y_{i}} - y^{*}|$ <br>
$s = 0.2\times\text{mean} |\hat{y_{i}} - y^{*}|$ is a good default
scale factor <br>
scale so that $\sum w_{i} = 1$
* Recursive partitioning with surrogate splits --- handles case
where a predictor of a variable needing imputation is missing
itself. But there are problems [@pen18pro] even with completely
random missingness.
* @whi11mul discusses an alternative method based on
choosing a donor observation at random from the $q$ closest matches
($q=3$, for example)
### Interactions
* When interactions are in the outcome model, oddly enough it may `r ipacue()`
be better to treat interaction terms as "just another variable"
and do unconstrained imputation of them [@kim15eva]
## Single Conditional Mean Imputation
`r mrg(sound("missing-3"))`
* Can fill-in using unconditional mean or median if number of `r ipacue()`
missings low and $X$ is unrelated to other $X$s
* Otherwise, first approximation to good imputation uses other
$X$s to predict a missing $X$
* This is a single "best guess" conditional mean
* $\hat{X}_{j} = Z \hat{\theta}, Z = X_{\bar j}$ plus possibly
auxiliary variables that precede $X_{j}$ in the causal chain that
are not intended to be in the outcome model.<br>
Cannot include $Y$ in $Z$ without adding random errors to imputed
values as done with multiple imputation (would steal info from $Y$)
* Recursive partitioning can sometimes be helpful for nonparametrically
estimating conditional means
## Predictive Mean Matching
## Multiple Imputation
* Single imputation could use a random draw from the conditional `r ipacue()`
distribution for an individual <br>
$\hat{X}_{j} = Z \hat{\theta} + \hat{\epsilon}, Z = [X\bar{j},
Y]$ plus auxiliary variables <br>
$\hat{\epsilon} = n(0, \hat{\sigma})$ or a random draw from the
calculated residuals
+ bootstrap
+ approximate Bayesian bootstrap [@rub91mul; @har07mul]: sample
with replacement from sample with replacement of residuals
* Multiple imputations ($M$) with random draws
+ Draw sample of $M$ residuals for each missing value to be imputed
+ Average $M$ $\hat{\beta}$
+ In general can provide least biased estimates of $\beta$
+ Simple formula for imputation-corrected
var($\hat{\beta}$) <br> Function of average "apparent"
variances and between-imputation variances of
$\hat{\beta}$
+ Even when the $\chi^2$ distribution is a good approximation
when data have no missing values, the $t$ or $F$ distributions are
needed to have accurate $P$-values and confidence limits when
there are missings [@lip02deg; @rei07sma]
+ **BUT** full multiple imputation needs to account for
uncertainty in the imputation models by refitting these models for
each of the $M$ draws
+ `transcan` does not do that; `aregImpute` does
* Note that multiple imputation can and should use the response
variable for imputing predictors [@moo06usi]
* `aregImpute` algorithm [@moo06usi] `r ipacue()`
`r mrg(sound("missing-4"))`
+ Takes all aspects of uncertainty into account using the
bootstrap
+ Different bootstrap resamples used for each imputation by fitting
a flexible additive model on a sample with replacement
from the original data
+ This model is used to predict all of the original missing and
non-missing values for the target variable for the current imputation
+ Uses flexible parametric additive regression models to impute
+ There is an option to allow target variables to be optimally
transformed, even non-mono\-ton\-ical\-ly (but this can overfit)
+ By default uses predictive mean matching for imputation; no residuals
required (can also do more parametric regression imputation)
+ By default uses weighted PMM; many other matching options
+ Uses by default van~Buuren's "Type 1" matching \cite[Section
3.4.2]{buu12fle} to capture the right amount of uncertainty by
computing predicted values for missing values using a
regression fit on the bootstrap sample, and finding donor
observations by matching those predictions to predictions from potential
donors using the regression fit from the original sample
of complete observations
+ When a predictor of the target variable is missing, it is first
imputed from its last imputation when it was a target variable
+ First 3 iterations of process are ignored ("burn-in")
+ Compares favorably to `R` `MICE` approach
+ Example:
```{r eval=FALSE}
a <- aregImpute(~ age + sex + bp + death + heart.attack.before.death,
data=mydata, n.impute=5)
f <- fit.mult.impute(death ~ rcs(age,3) + sex +
rcs(bp,5), lrm, a, data=mydata)
```
See @bar04imp for a nice review of multiple
imputation with detailed comparison of results
(point estimates and confidence limits for the effect of the
sometimes-missing predictor) for various imputation methods.
@bar06mul have a good overview of imputation
methods and a comparison of bias and confidence interval coverage for
the methods when applied to longitudinal data with a small number of subjects.
@hor07muc have a good review of several
software packages for dealing with missing data, and a comparison of
them with `aregImpute`. @har07mul provide a
nice overview of multiple imputation and discuss some of the
available software. @whi10bia studied bias
of multiple imputation vs. complete-case analysis.
@whi11mul provide much practical guidance.
**Caution**: Methods can generate imputations having very
reasonable distributions but still not having the property that final
response model regression coefficients have nominal confidence
interval coverage. It is worth checking that imputations generate the
correct collinearities among covariates.
* With `MICE` and `aregImpute` we are using the chained `r ipacue()`
equation approach [@whi11mul] `r ipacue()`
* Chained equations handles a wide variety of target variables to
be imputed and allows for multiple variables to be missing on the
same subject
* Iterative process cycles through all target
variables to impute all missing values [@buu06ful]
* Does not attempt to use the full Bayesian
multivariate model for all target variables, making it more
flexible and easy to use
* Possible to create improper imputations, e.g., imputing
conflicting values for different target variables
* However, simulation studies [@buu06ful] demonstrate
very good performance of imputation based on chained equations
## Diagnostics
`r mrg(sound("missing-5"))`
* MCAR can be partially assessed by comparing distribution of `r ipacue()`
non-missing $Y$ for those subjects with complete $X$ vs. those
subjects having incomplete $X$ [@littlerubin]
* Yucel and Zaslavsky [@yuc08usi; see also @he12dia]
* Interested in reasonableness of imputed values for a
sometimes-missing predictor $X_{j}$
* Duplicate entire dataset
* In the duplicated observations set all non-missing values of
$X_{j}$ to missing; let $w$ denote this set of observations set to missing
* Develop imputed values for the missing values of $X_{j}$
* In the observations in $w$ compare the
distribution of imputed $X_{j}$ to the original values of $X_{j}$
* @bon16gra present a variety of
useful diagnostics on the reasonableness of imputed values.
<img src="missing-diagnostic.png" width="60%">
## Summary and Rough Guidelines
`r mrg(sound("missing-6"))` `r ipacue()`
| Method: | Deletion | Single | Multiple |
|-----------------------------------------|---------------|---|---|
| Allows non-random missing | | x | x |
| Reduces sample size | x | | |
| Apparent S.E. of $\hat{\beta}$ too low | | x | |
| Increases real S.E. of $\hat{\beta}$ | x | | |
| $\hat{\beta}$ biased | if not MCAR | x | |
: Summary of methods for dealing with missing values {#tbl-na-meth-summary}
The following contains crude guidelines. Simulation studies are
needed to refine the recommendations. Here $f$ refers to
the proportion of observations having _any_ variables missing.
* **$f < 0.03$:** `r ipacue()`
It doesn't matter very much how you impute missings or whether
you adjust variance of regression coefficient estimates for
having imputed data in this case. For
continuous variables imputing missings with the median
non-missing value is adequate; for categorical predictors the
most frequent category can be used. Complete case analysis is
also an option here. Multiple imputation may be needed to check
that the simple approach "worked."
* **$f \geq 0.03$:**
Use multiple imputation with number of
imputations^[@whi11mul recommend choosing $M$ so that the key inferential statistics are very reproducible should the imputation analysis be repeated. They suggest the use of $100f$ imputations. See also [@buu12fle, section 2.7]. @hip16num finds that the number of imputations should be quadratically increasing with the fraction of missing information.]
equal to $\max(5, 100f)$. Fewer imputations may be possible with
very large sample
sizes. See [statisticalhorizons.com/how-many-imputations](https://statisticalhorizons.com/how-many-imputations).
Type 1 predictive mean matching is usually preferred, with
weighted selection of donors. Account for imputation in estimating
the covariance matrix for final parameter estimates. Use the $t$
distribution instead of the Gaussian distribution for tests and
confidence intervals, if possible, using the estimated d.f. for the
parameter estimates.
* **Multiple predictors frequently missing:** `r ipacue()`
More imputations may be required. Perform a "sensitivity to
order" analysis by creating multiple imputations
using different orderings of sometimes missing variables. It may be
beneficial to initially sort variables so that the one with the most
`NA`s will be imputed first.
Reason for missings more important than number of missing values.
Extreme amount of missing data does not prevent one from using
multiple imputation, because alternatives are worse [@jan10mis; @mad19pro].
### Effective Sample Size
It is useful to look look at examples of effective sample sizes in the presence of missing data. If a sample of 1000 subjects contains various amounts and patterns of missings what size $n_c$ of a complete sample would have equivalent information for the intended purpose of the analysis?
1. A new marker was collected on a random sample of 200 of the subjects and one wants to estimate the added predictive value due to the marker: $n_{c}=200$ `r ipacue()`
1. Height is missing on 100 subjects but we want to study association between BMI and outcome. Weight, sex, and waist circumference are available on all subjects: $n_{c}=980$
1. Each of 10 predictors is randomly missing on $\frac{1}{10}$ of subjects, and the predictors are uncorrelated with each other and are each weakly related to the outcome: $n_{c}=500$
1. Same as previous but the predictors can somewhat be predicted from non-missing predictors: $n_{c}=750$
1. The outcome variable was not assessed on a random $\frac{1}{5}$ of subjects: $n_{c}=800$
1. The outcome represents sensitive information, is missing on $\frac{1}{2}$ of subjects, and we don't know what made subjects respond to the question: $n_{c}=0$ (serious selection bias)
1. One of the baseline variables was collected prospectively $\frac{1}{2}$ of the time and for the other subjects it was retrospectively estimated only for subjects ultimately suffering a stroke and we don't know which subjects had a stroke: $n_{c}=0$ (study not worth doing)
1. The outcome variable was assessed by emailing the 1000 subjects, for which 800 responded, and we don't know what made subjects respond: $n_{c}=0$ (model will possibly be very biased---at least the intercept)
## Bayesian Methods for Missing Data
* Multiple imputation developed as an approximation to a full `r ipacue()`
Bayesian model
* Full Bayesian model treats missings as unknown parameters and
provides exact inference and correct measures of uncertainty
* See [this case study](https://github.com/paul-buerkner/brms/blob/master/vignettes/brms_missings.Rmd) for an example
* The case study also shows how to do "posterior stacking" if
you want to avoid having to specify a full model for missings, and
instead use usual multiple imputations as described in this chapter
+ Run a multiple imputation algorithm
+ For each completed dataset run the Bayesian analysis and draw
thousands of samples from the posterior distribution of the parameters
+ Pool all these posterior draws over all the multiple
imputations and do posterior inference as usual with no special
correction required
+ Made easy by the `Hmisc` package `aregImpute` function
and the `rms` `stackMI` function as demonstrated in the
Titanic case study later in the notes.
```