From: Tim Hesterberg To: Frank Harrell Subject: bias of imputation Date: Tuesday, May 26, 1998 6:27 PM Frank, You are using single imputation in Hmisc. I believe this gives biased results, not only for standard errors but also for e.g. regression coefficients. I'll give two examples, for the cases where missing values in the explanatory variable(s) are imputed by regression or by the mean for the variable. In both examples there are no missing values in the response variable Y, and the occurence of missing values in the explanatory variables is completely at random (independent of both Y and the (possibly unobserved) values of the explanatory variable. I'll assume for simplicity that all variables have mean 0. First, imputation by regression. Suppose there is a single explanatory variable X, that Var(Y) = Var(X) = 1, and Cov(X,Y) = .4, and that 20% of the X values are missing. The true beta is .4. The expected value of X given Y is .4Y. After replacing missing X's with .4Y, Var(X) = .8 + .2*Var(.4Y) = .8 + .2*.16 = .832 Cov(X,Y) = .8*.4 + .2*.4 = .4 beta(using imputation) = .4/.832, not .4. Second, mean imputation. Suppose there are two explanatory variables, X1 and X2, and that 20% of the values in X1 are missing, and none in Y or X2. Suppose that Var(Y) = Var(X1) = Var(X2) = 1, and that Cov(Y,X1) = .2, Cov(Y, X2) = .3, Cov(X1, X2) = .4. These values are for illustration, there is nothing magic about them. The true vector-beta is M^{-1} v = (.095, .262) where v=(.2, .3) and M is the 2x2 matrix matrix(c(1,.4,.4,1),2). With missing values in X1 filled in by the mean of X1, Var(X1) = .8, Cov(Y,X1) = .8*.2, and Cov(X1,X2) = .8*.4. The resulting beta is M_2^{-1} v_2 = (.092, .271) where v_2=c(.16, .3) and M_2 = matrix(c(.8,.32,.32,1), 2) Note that the resulting beta differs from the true value. In both cases the bias does not disappear as the sample size increases, if the fraction of missing values remains constant. Tim ========================================================== | Tim Hesterberg Research Scientist | | timh@statsci.com MathSoft/Statistical Sciences | | (206)283-8802x319 1700 Westlake Ave. N, Suite 500 | | (206)283-0347 (fax) Seattle, WA 98109-3044, U.S.A. | ========================================================== Frank, >I'll be getting back to you soon regarding your SBIR E-mail.  >I need to >also study your very useful notes below.  One initial question: how does >multiple imputation fare in these situations?  -Frank The multiple imputation we've done is model-based, e.g. that the joint distribution of the variables is multivariate normal. If the model is correct all covariances are unbiased and the \hat\beta's are consistent. >P.S. I push single imputation as a simple alternative to the worst >approach (throwing away observations) and the next worst (substituting >a constant such as the median).  It's a bridge to the best solution. >Right now I have enough trouble getting imputation accepted at all >by researchers who are used to complete-case analysis. That's an interesting point. In our multiple imputation software we that there be at least two imputations.  Even if you would bypass the checks for this to create an object with only one imputation, when other objects are defined based on that object the new objects would have only one imputation, and would be converted automatically to an ordinary object with no imputations.  We don't support single imputation.  I'll bring this up for discussion, but suspect we won't change it, both for statistical and implementation reasons. Tim >>>I need to >>>also study your very useful notes below.  One initial question: how does >>>multiple imputation fare in these situations?  -Frank >> >>The multiple imputation we've done is model-based, e.g. that >>the joint distribution of the variables is multivariate normal. >>If the model is correct all covariances are unbiased and the \hat\beta's are >>consistent. > >What I remain confused about is that if you are doing a full >model-based analysis, assuming multivariate normality for >the predictors, why do you need to do multiple imputation, >or for that matter, why do you do any imputation instead of just >solving the (messy) likelihood equation?  Is the multiple >imputation really a tool to obtain certain numerical integrals, >etc. via a simulation technique?  If it is, there may be a better >label than 'multiple imputation' for your general approach. > >Thanks for educating me, In the full model-based analysis you could just solve the likelihood equations; in that case you may view multiple imputation as an approximation. There are other cases where the model used to create the imputations may not be correct, but you use the imputations anyway.  If the fraction of missing observations is small, the fact that the distribution of the random multiple imputations isn't quite right won't matter much. If you have some time, I recommend Joe Schafer's book Analysis of Incomplete Multivariate Data, published by Chapman and Hall (1997). >So my take on the situation is: (1) multiple imputation as defined by >most statisticians means filling-in NAs multiple times, obtaining >beta hat by the naive formula, and averaging the beta hats. Yes >(2) multiple imputation, with the above nomenclature >has the same bias as single imputation for estimating >regression coefficients (E(beta hat from multiple) has to equal E(beta hat from >single) as the former estimate is a simple average of the latter).  I asked >Rod Little once whether the only reason people use multiple imputation is >to get variances using the multiple imp. formula and he said yes.  I.e., >if you use the bootstrap you can impute expected values (once) instead >of random realizations of E(X2 | X1). No.  But let me quantify that. If you use single imputation, with the imputed values generated from the same distribution that would be used for multiple imputation, then it is true that E(beta hat from multiple) = E(beta hat from single RANDOM) Some bias may arise because the random distribution for the imputations may not be quite right, but this bias tends to be minor. Using the bootstrap with a single random imputation for each missing value for each bootstrap sample would be proper. But if you use single imputation by replacing each missing value with the average for that variable, or the expected value of the value given other variables, then the bias can be substantial. Tim Frank, Frank wrote: >The bias of imputing using E(X2 | X1) seems to be at adds with what >Werner Vach (Netherlands) told me yesterday, at least for logistic >models.  Werner wrote a book on missing data imputation for binary >logistic models - he is quite expert in this area.  Here's what he >said - comments welcome!  Werner wrote a nice chapter on missing data >in epidemiologic modeling in the recently printed Encyclopedia of >Biostatistics.  I need to read it. I won't claim to be as expert as Werner, but that won't stop me from giving (uninformed but loud :-)) comments. >Werner Vach wrote: >4) If someone asked me in the moment, how to handle incomplete >>covariate data in logistic regression models, my answer is: imputation >>of conditional expectations (or probabilities). I believe that this >>is currently the best choice, because it is relative simple and yields >>nearly unbiased estimates. The estimates using mean imputation may be nearly unbiased if the fraction of missing information is small. There's an ambiguity here too, to which I'll return below. >>And to my knowledge, it yields also >>very often correct variance-covariance estimates (and so far I know >>this is also the experience of Michael Schemper). The theoretical >>reason for this is probably, that in a logistic model correct specification >>of the mean implies correct specification of the variance, which makes >>it different form the Gaussian model. The increase of variance due >>to using estimated covariates instead of true covariate values seems >>to be surprisingly very small. However, this result is more or less >>only corroborated by simulation studies for the two covariate case, >>and it might be different for more complex situations. >>Summarizing, I see little reason for estimating variances/covriances >>by using the bootstrap, so far we habe no real indication that the >>naive variances are wrong. >Can you explain in simplest terms why imputing using E(X2|X1) would be >more biased than imputing using predicted individual X2 | X1? The latter consists of the former plus some random noise, which reflects the uncertainty in X2 given X1.  Not using that noise corresponds to pretending that X2 is known with higher precision than it actually is. But, there's a larger issue here.  When doing imputations, you should not use either E(X2|X1) or the conditional distribution of X2 | X1, but rather the conditional distribution of X2 | X1 and Y. To see this, assume the multivariate Gaussian model is correct, that Y and X2 are highly correlated (say R=.99), and that X1 is independent of both X2 and Y.  I'll look at all three cases. Case 1:  Imputing with E(X2|X1) This corresponds to using mean imputation for X2.  Having done that, we can basically ignore X1, because it is independent of the other variables and it's regression coefficient will be near zero.  The scatterplot of Y vs X2 will show two distinct groups: (1) the completely observed points fall on a diagonal line plus a bit of noise, (2) the imputed points fall on a vertical line, with X2 set at its observed mean. If the values of X2 are Missing Completely at Random, then the regression coefficients (slope and intercept) are unbiased, but the standard error's are completely incorrect. X2 _ rnorm(40) Y _ .99*X2 + sqrt(1-.99^2)*rnorm(40) X2[ runif(40) < .2 ] _ NA X2i _ X2 X2i[ which.na(X2) ] _ mean(X2, na.rm=T) plot(X2i, Y) abline( lm(Y~X2i, na.action=na.omit)) But in the more common case that X2 is Missing at Random (i.e. whether X2 is missing is independent of the value of Y | X2), it may be that say large value of X2 are more likely to be missing. The regression slope is still unbiased, but the intercept is biased. X2 _ rnorm(40) Y _ .99*X2 + sqrt(1-.99^2)*rnorm(40) X2[ X2 > .5 ] _ NA X2i _ X2 X2i[ which.na(X2) ] _ mean(X2, na.rm=T) plot(X2i, Y) abline( lm(Y~X2i, na.action=na.omit)) My intuition is that in higher dimensional problems, or where X1 is not independent of everything else, that all regression coefficients will be biased. Case 2:  Imputing with conditional distn of X2 | X1 This corresponds to using the unconditional distribution of X2. Again we ignore X1. The scatterplot of Y vs X2 will show two distinct groups: (1) the completely observed points fall on a diagonal line plus a bit of noise, (2) the imputed points form an uncorrelated cloud. All regression coefficients are biased. X2 _ rnorm(40) Y _ .99*X2 + sqrt(1-.99^2)*rnorm(40) X2[ X2 > .5 ] _ NA X2i _ X2 w _ which.na(X2) X2i[ w ] _ rnorm(length(w)) plot(X2i, Y) abline( lm(Y~X2i, na.action=na.omit)) Case 3:  Imputing with conditional distn of X2 | X1,Y Again we ignore X1. The scatterplot of Y vs X2 shows only one group: (1) the completely observed points fall on a diagonal line plus a bit of noise, (2) so do the imputed points. X2 _ rnorm(40) Y _ .99*X2 + sqrt(1-.99^2)*rnorm(40) X2[ X2 > .5 ] _ NA X2i _ X2 w _ which.na(X2) X2i[ w ] _ .99*Y[ w ] + sqrt(1-.99^2)*rnorm(length(w)) plot(X2i, Y) points(X2i[w], Y[w], col=2) abline( lm(Y~X2i, na.action=na.omit)) Caveat: In practice you don't know the conditional distribution, so the multiple imputations also incorporate uncertainty in the model parameters. Now I'll return to the "ambiguity" mentioned above. Werner's statement mentions "imputation of conditional expectations", but doesn't indicate whether he means conditional on Y and other X's, or just on other X's.  The former would correspond to case 4. Case 4:  Imputing with conditional mean of X2 | X1,Y Again we ignore X1. The scatterplot of Y vs X2 shows two groups: (1) the completely observed points fall on a diagonal line plus a bit of noise, (2) the imputed points fall exactly on the line X2 _ rnorm(40) Y _ .99*X2 + sqrt(1-.99^2)*rnorm(40) X2[ X2 > .5 ] _ NA X2i _ X2 w _ which.na(X2) X2i[ w ] _ .99*Y[ w ] plot(X2i, Y) points(X2i[w], Y[w], col=2) abline( lm(Y~X2i, na.action=na.omit)) Note that the line for the (2) points is X2 = E[X2 | Y], not Y = E[Y | X2].  This would be more noticeable with lower correlation: X2 _ rnorm(200) Y _ .7*X2 + sqrt(1-.7^2)*rnorm(200) X2[ X2 > .5 ] _ NA X2i _ X2 w _ which.na(X2) X2i[ w ] _ .7*Y[ w ] plot(X2i, Y) points(X2i[w], Y[w], col=2) abline( lm(Y~X2i, na.action=na.omit), col=3) abline( lm(Y~X2, na.action=na.omit)) Using this method gives biased coefficients. >I forgot to mention one other thing Werner Vach said.  He felt that >in many cases expected value imputation is better (less biases) than random >imputation because it solves some of the measurement error >problem with covariables.  For what it's worth  -Frank I haven't thought about that.  My intuition would go the other way; that using mean imputation tends to overstate the precision with which covariates are known, whereas if the variables are measured with error then the precision is actually less than the usual models assume. Tim >This material is enormously helpful to me. Thanks so much for taking the >trouble. You're welcome. >I have remaining two questions. > >1. How does imputation of expected values work when you condition on X1 and Y? > >2. I trouble getting medical researchers to accept any sort of >imputation. If I tell them that the imputation used Y and possibly >stole some information from Y leading to a little circularity in the >model (possibly elevated chi-squares for testing association with Y), >they will probably flip out. That's why I've followed a conservative >approach to conditioning only on the other X's [However I could make Y >available to transcan]. > What's your answer to that concern? Answer to 1: don't do it (imputation of conditioned on Y). If you do, then you will have an elevated estimate of association, be it R^2 or chi-squares. The flipped-out researchers would be right. Here's a variation on case 4 that demonstrates this, a variation on case 4: Case 4: Imputing with conditional mean of X2 | X1,Y Again we ignore X1. The scatterplot of Y vs X2 shows two groups: (1) the completely observed points fall on a diagonal line plus a bit of noise, (2) the imputed points fall exactly on the line set.seed(2) X2 _ rnorm(200) Y _ .7*X2 + sqrt(1-.7^2)*rnorm(200) cor(X2, Y) # correlation .701 X2[ 1:100 ] _ NA X2i _ X2 w _ which.na(X2) X2i[ w ] _ .7*Y[ w ] cor(X2i, Y) # correlation .835 -- too high plot(X2i, Y) points(X2i[w], Y[w], col=2) abline( lm(Y~X2i, na.action=na.omit), col=3) abline( lm(Y~X2, na.action=na.omit)) Answer to 2: Graphics might help, e.g. the second example to case 1, or case 2, to show what what goes wrong when you don't condition on Y. Emphasize that you are using the conditional distribution of X2 given Y and the other variables, not the expected value, and show them case 3, where things do look reasonable. You may want to find another example for case 1, one in which the regression slope (not only the intercept) is biased. This may involve some correlation between X1 and X2. Here's one example, that keeps X1 and X2 independent, but for which the value of Y affects whether X2 is missing: rho _ .8; n _ 200 X2 _ rnorm(n) Y _ rho*X2 + sqrt(1-rho^2)*rnorm(n) X2[ Y > .5 ] _ NA X2i _ X2 X2i[ which.na(X2) ] _ mean(X2, na.rm=T) plot(X2i, Y) abline( lm(Y~X2i, na.action=na.omit)) abline(1,rho,col=2) I'll take a stab at an explanation too. First, assume that there is some relationship between X2 and Y, and in particular some relationship which is not explained by the other variables. This is, after all, why we would want to use variable X2 when predicting Y. But conversely, we would also want to use that information when predicting X2 (using Y, and the other variables). To not use that information is akin to denying that there is a relationship between X2 and Y, at least for the observations for which the value of X2 is unobserved. To impute X2 without using Y would be to create a data set in which the relationship between X2 and Y was artificially weakened. Now, "you" may be concerned that predicting X2 using Y, then turning around and predicting Y using X2, would be circular, and lead to overly-high estimates of association (chi-square or R^2). You would be right if we replaced value of X2 with their given Y. But actually we replace X2 with its expected value plus an appropriate amount of random variation. Recall that R^2 has the interpretation of "percent of variation explained"; we're adding in both the right amounts of explained variation and unexplained variation, in order to get the overall association correct. Tim >Again, this is an enormous help to my understanding. >With your permission I would like to update the currently >too short section in >my book about imputation with extracts of several of your >nice messages, with attributions of course. Thanks, that's very flattering. Go ahead. I'd like to review that, if possible, but a review is not mandatory (especially if you're fighting a deadline at some point). I should note that there are some details I've omitted. Let's look at the simplest case, where there is only one variable Y, with some observations missing completely at random, and that we want to impute the missing values. Suppose that Y is normal. If mu and sigma are known, one would impute the missing values from N(mu, sigma) More likely, mu and sigma are not known. One might estimate them in the usual way, then impute from N(xbar, s). But that would tend to result in variability that is too small. Instead, in the more rigorous random imputation done in Schafer's book and the forthcoming S-Plus software, one would generate random mu and sigma from their posterior distribution under an appropriate model, then use them in generating random X's. Under the non-informative prior, sigma ~ s / sqrt( a chi-square with n-1 degrees of freedom / (n-1)) mu ~ N(xbar, sigma^2/n) (Don't quote me on those, I'd need to check that I'm not missing a constant somewhere.) In other words, we can't generate random values from their conditional distribution (which is unknown), but rather from their conditional distribution given values of the unknown parameters, which are generated from the conditional distributions of the parameters given the observed data. For multivariate problems, generating the random covariance matrix involves the inverse of a random Wishhart variate. For you to do everything in this more rigorous way would be a very large effort. We're involved in such an effort here. I suggest that you not do so in your software, but make appropriate notes in the documentation, refer to Joe Schafer's code on statlib, and stick in some advertising for the forthcoming S+MissingData Module :-). As long as the fraction of missing data is small, the final results wouldn't be too different. >You've got me convinced that I need to make the default >imputation method be random draws and including Y >as a predictor of X_i. I'm going to make the transcan function >do this by back-solving on the original scale from the transformed >normal(0,1) scale. That sounds reasonable. >For categorical predictors I may add an >option to do random draws from a multinomial distribution >arising from a classification tree (from tree()). That, or from the multinomial distribution obtained from a log-linear model; but then the software or user has to specify which interactions are included. Regards, Tim ---------------------------------------CORRECTIONS------------------------ Frank, There were mistakes in my message of 29 May. No conclusions are changed. I'll preface the original message with ">", and give corrected versions. (in the second part of Case 1) > But in the more common case that X2 is Missing at Random (i.e. whether > X2 is missing is independent of the value of Y | X2), it may be X2 is missing is independent of the value of X2 | Y), it may be > that say large value of X2 are more likely to be missing. > The regression slope is still unbiased, but the intercept is biased. > > X2 _ rnorm(40) > Y _ .99*X2 + sqrt(1-.99^2)*rnorm(40) > X2[ X2 > .5 ] _ NA X2[ Y > .5 ] _ NA > X2i _ X2 > X2i[ which.na(X2) ] _ mean(X2, na.rm=T) > plot(X2i, Y) > abline( lm(Y~X2i, na.action=na.omit)) (in Case 2, Case 3, and twice in Case 4) > X2[ X2 > .5 ] _ NA X2[ Y > .5 ] _ NA