From: Tim Hesterberg <timh@statsci.com>
To: Frank Harrell <Fharrell@virginia.edu>
Subject: bias of imputation
Date: Tuesday, May 26, 1998 6:27 PM

Frank,
   You are using single imputation in Hmisc.  I believe this gives
biased results, not only for standard errors but also for e.g. 
regression coefficients.

I'll give two examples, for the cases where missing values in the
explanatory variable(s) are imputed by regression or by the mean
for the variable.  In both examples there are no missing values in 
the response variable Y, and the occurence of missing values in the 
explanatory variables is completely at random (independent of both
Y and the (possibly unobserved) values of the explanatory variable.
I'll assume for simplicity that all variables have mean 0.


First, imputation by regression.
Suppose there is a single explanatory variable X, that Var(Y) = Var(X) = 1,
and Cov(X,Y) = .4, and that 20% of the X values are missing.

The true beta is .4.

The expected value of X given Y is .4Y.  After replacing missing X's
with .4Y, 
	Var(X) = .8 + .2*Var(.4Y) = .8 + .2*.16 = .832
	Cov(X,Y) = .8*.4 + .2*.4 = .4
	beta(using imputation) = .4/.832, not .4.


Second, mean imputation.
Suppose there are two explanatory variables, X1 and X2, and that
20% of the values in X1 are missing, and none in Y or X2.
Suppose that Var(Y) = Var(X1) = Var(X2) = 1, and that
Cov(Y,X1) = .2, Cov(Y, X2) = .3, Cov(X1, X2) = .4.
These values are for illustration, there is nothing magic about them.

The true vector-beta is 
	M^{-1} v = (.095, .262)
where v=(.2, .3) and M is the 2x2 matrix matrix(c(1,.4,.4,1),2).

With missing values in X1 filled in by the mean of X1, 
Var(X1) = .8, Cov(Y,X1) = .8*.2, and Cov(X1,X2) = .8*.4.
The resulting beta is 
	M_2^{-1} v_2 = (.092, .271)
where v_2=c(.16, .3) and M_2 = matrix(c(.8,.32,.32,1), 2)

Note that the resulting beta differs from the true value.


In both cases the bias does not disappear as the sample size increases,
if the fraction of missing values remains constant.


Tim 

==========================================================
| Tim Hesterberg         Research Scientist              |
| timh@statsci.com       MathSoft/Statistical Sciences   |
| (206)283-8802x319      1700 Westlake Ave. N, Suite 500 |
| (206)283-0347 (fax)    Seattle, WA 98109-3044, U.S.A.  |
==========================================================

Frank,
>I'll be getting back to you soon regarding your SBIR E-mail.  

>I need to
>also study your very useful notes below.  One initial question: how does
>multiple imputation fare in these situations?  -Frank

The multiple imputation we've done is model-based, e.g. that
the joint distribution of the variables is multivariate normal.
If the model is correct all covariances are unbiased and the \hat\beta's are 
consistent.

>P.S. I push single imputation as a simple alternative to the worst
>approach (throwing away observations) and the next worst (substituting
>a constant such as the median).  It's a bridge to the best solution.
>Right now I have enough trouble getting imputation accepted at all
>by researchers who are used to complete-case analysis.

That's an interesting point.
In our multiple imputation software we <require> that there be at
least two imputations.  Even if you would bypass the checks for this
to create an object with only one imputation, when other objects are
defined based on that object the new objects would have only one
imputation, and would be converted automatically to an ordinary object
with no imputations.  We don't support single imputation.  
I'll bring this up for discussion, but suspect we won't change it,
both for statistical and implementation reasons.

Tim


>>>I need to
>>>also study your very useful notes below.  One initial question: how does
>>>multiple imputation fare in these situations?  -Frank
>>
>>The multiple imputation we've done is model-based, e.g. that
>>the joint distribution of the variables is multivariate normal.
>>If the model is correct all covariances are unbiased and the \hat\beta's are 
>>consistent.
>
>What I remain confused about is that if you are doing a full
>model-based analysis, assuming multivariate normality for
>the predictors, why do you need to do multiple imputation,
>or for that matter, why do you do any imputation instead of just
>solving the (messy) likelihood equation?  Is the multiple
>imputation really a tool to obtain certain numerical integrals,
>etc. via a simulation technique?  If it is, there may be a better
>label than 'multiple imputation' for your general approach.
>
>Thanks for educating me,

In the full model-based analysis you could just solve the likelihood
equations; in that case you may view multiple imputation as an
approximation.

There are other cases where the model used to create the imputations
may not be correct, but you use the imputations anyway.  If the
fraction of missing observations is small, the fact that the distribution
of the random multiple imputations isn't quite right won't matter much.

If you have some time, I recommend Joe Schafer's book
Analysis of Incomplete Multivariate Data, published by Chapman and Hall (1997).


>So my take on the situation is: (1) multiple imputation as defined by
>most statisticians means filling-in NAs multiple times, obtaining
>beta hat by the naive formula, and averaging the beta hats.

Yes

>(2) multiple imputation, with the above nomenclature
>has the same bias as single imputation for estimating
>regression coefficients (E(beta hat from multiple) has to equal E(beta hat from
>single) as the former estimate is a simple average of the latter).  I asked
>Rod Little once whether the only reason people use multiple imputation is
>to get variances using the multiple imp. formula and he said yes.  I.e.,
>if you use the bootstrap you can impute expected values (once) instead
>of random realizations of E(X2 | X1).

No.  But let me quantify that.

If you use single imputation, with the imputed values generated <randomly>
from the same distribution that would be used for multiple imputation,
then it is true that 
E(beta hat from multiple) = E(beta hat from single RANDOM)
Some bias may arise because the random distribution for the imputations
may not be quite right, but this bias tends to be minor.
Using the bootstrap with a single random imputation for each missing
value for each bootstrap sample would be proper.

But if you use single imputation by replacing each missing value with the
average for that variable, or the expected value of the value given
other variables, then the bias can be substantial.

Tim


Frank,

Frank wrote:
>The bias of imputing using E(X2 | X1) seems to be at adds with what
>Werner Vach (Netherlands) told me yesterday, at least for logistic
>models.  Werner wrote a book on missing data imputation for binary
>logistic models - he is quite expert in this area.  Here's what he
>said - comments welcome!  Werner wrote a nice chapter on missing data
>in epidemiologic modeling in the recently printed Encyclopedia of
>Biostatistics.  I need to read it.

I won't claim to be as expert as Werner, but that won't stop me
from giving (uninformed but loud :-)) comments.

>Werner Vach wrote:
>4) If someone asked me in the moment, how to handle incomplete
>>covariate data in logistic regression models, my answer is: imputation
>>of conditional expectations (or probabilities). I believe that this
>>is currently the best choice, because it is relative simple and yields
>>nearly unbiased estimates. 

The estimates using mean imputation may be nearly unbiased if
the fraction of missing information is small.

There's an ambiguity here too, to which I'll return below.

>>And to my knowledge, it yields also
>>very often correct variance-covariance estimates (and so far I know
>>this is also the experience of Michael Schemper). The theoretical
>>reason for this is probably, that in a logistic model correct specification
>>of the mean implies correct specification of the variance, which makes
>>it different form the Gaussian model. The increase of variance due
>>to using estimated covariates instead of true covariate values seems
>>to be surprisingly very small. However, this result is more or less
>>only corroborated by simulation studies for the two covariate case,
>>and it might be different for more complex situations.
>>Summarizing, I see little reason for estimating variances/covriances
>>by using the bootstrap, so far we habe no real indication that the
>>naive variances are wrong.

>Can you explain in simplest terms why imputing using E(X2|X1) would be
>more biased than imputing using predicted individual X2 | X1?

The latter consists of the former plus some random noise, which reflects
the uncertainty in X2 given X1.  Not using that noise corresponds to
pretending that X2 is known with higher precision than it actually is.

But, there's a larger issue here.  When doing imputations, you should
not use either E(X2|X1) or the conditional distribution of X2 | X1,
but rather the conditional distribution of X2 | X1 and Y.

To see this, assume the multivariate Gaussian model is correct,
that Y and X2 are highly correlated (say R=.99), 
and that X1 is independent of both X2 and Y.  
I'll look at all three cases.

Case 1:  Imputing with E(X2|X1) 
This corresponds to using mean imputation for X2.  Having done that,
we can basically ignore X1, because it is independent of the other
variables and it's regression coefficient will be near zero.  The
scatterplot of Y vs X2 will show two distinct groups:
(1) the completely observed points fall on a diagonal line plus
a bit of noise, 
(2) the imputed points fall on a vertical line, with X2 set at its 
observed mean.
If the values of X2 are Missing Completely at Random, then
the regression coefficients (slope and intercept) are unbiased,
but the standard error's are completely incorrect.

X2 _ rnorm(40)
Y _ .99*X2 + sqrt(1-.99^2)*rnorm(40)
X2[ runif(40) < .2 ] _ NA
X2i _ X2
X2i[ which.na(X2) ] _ mean(X2, na.rm=T)
plot(X2i, Y)
abline( lm(Y~X2i, na.action=na.omit))

But in the more common case that X2 is Missing at Random (i.e. whether
X2 is missing is independent of the value of Y | X2), it may be
that say large value of X2 are more likely to be missing.
The regression slope is still unbiased, but the intercept is biased.

X2 _ rnorm(40)
Y _ .99*X2 + sqrt(1-.99^2)*rnorm(40)
X2[ X2 > .5 ] _ NA
X2i _ X2
X2i[ which.na(X2) ] _ mean(X2, na.rm=T)
plot(X2i, Y)
abline( lm(Y~X2i, na.action=na.omit))

My intuition is that in higher dimensional problems, or where X1 is
not independent of everything else, that all regression coefficients
will be biased.


Case 2:  Imputing with conditional distn of X2 | X1
This corresponds to using the unconditional distribution of X2.
Again we ignore X1.
The scatterplot of Y vs X2 will show two distinct groups:
(1) the completely observed points fall on a diagonal line plus
a bit of noise, 
(2) the imputed points form an uncorrelated cloud.
All regression coefficients are biased.

X2 _ rnorm(40)
Y _ .99*X2 + sqrt(1-.99^2)*rnorm(40)
X2[ X2 > .5 ] _ NA
X2i _ X2
w _ which.na(X2)
X2i[ w ] _ rnorm(length(w))
plot(X2i, Y)
abline( lm(Y~X2i, na.action=na.omit))


Case 3:  Imputing with conditional distn of X2 | X1,Y
Again we ignore X1.
The scatterplot of Y vs X2 shows only one group:
(1) the completely observed points fall on a diagonal line plus
a bit of noise, 
(2) so do the imputed points.

X2 _ rnorm(40)
Y _ .99*X2 + sqrt(1-.99^2)*rnorm(40)
X2[ X2 > .5 ] _ NA
X2i _ X2
w _ which.na(X2)
X2i[ w ] _ .99*Y[ w ] + sqrt(1-.99^2)*rnorm(length(w))
plot(X2i, Y)
points(X2i[w], Y[w], col=2)
abline( lm(Y~X2i, na.action=na.omit))

Caveat:
In practice you don't know the conditional distribution, so the multiple
imputations also incorporate uncertainty in the model parameters.


Now I'll return to the "ambiguity" mentioned above.
Werner's statement mentions "imputation of conditional expectations",
but doesn't indicate whether he means conditional on Y and other X's,
or just on other X's.  The former would correspond to case 4.

Case 4:  Imputing with conditional mean of X2 | X1,Y
Again we ignore X1.
The scatterplot of Y vs X2 shows two groups:
(1) the completely observed points fall on a diagonal line plus
a bit of noise, 
(2) the imputed points fall exactly on the line

X2 _ rnorm(40)
Y _ .99*X2 + sqrt(1-.99^2)*rnorm(40)
X2[ X2 > .5 ] _ NA
X2i _ X2
w _ which.na(X2)
X2i[ w ] _ .99*Y[ w ] 
plot(X2i, Y)
points(X2i[w], Y[w], col=2)
abline( lm(Y~X2i, na.action=na.omit))

Note that the line for the (2) points is X2 = E[X2 | Y],
not Y = E[Y | X2].  This would be more noticeable with lower correlation:

X2 _ rnorm(200)
Y _ .7*X2 + sqrt(1-.7^2)*rnorm(200)
X2[ X2 > .5 ] _ NA
X2i _ X2
w _ which.na(X2)
X2i[ w ] _ .7*Y[ w ] 
plot(X2i, Y)
points(X2i[w], Y[w], col=2)
abline( lm(Y~X2i, na.action=na.omit), col=3)
abline( lm(Y~X2, na.action=na.omit))

Using this method gives biased coefficients.

>I forgot to mention one other thing Werner Vach said.  He felt that
>in many cases expected value imputation is better (less biases) than random
>imputation because it solves some of the measurement error
>problem with covariables.  For what it's worth  -Frank

I haven't thought about that.  My intuition would go the other way;
that using mean imputation tends to overstate the precision with which
covariates are known, whereas if the variables are measured with error
then the precision is actually less than the usual models assume.

Tim


>This material is enormously helpful to me.  Thanks so much for taking the
>trouble.

You're welcome.

>I have remaining two questions.
>
>1.  How does imputation of expected values work when you condition on X1 and Y?
>
>2.  I trouble getting medical researchers to accept any sort of
>imputation.  If I tell them that the imputation used Y and possibly
>stole some information from Y leading to a little circularity in the
>model (possibly elevated chi-squares for testing association with Y),
>they will probably flip out.  That's why I've followed a conservative
>approach to conditioning only on the other X's [However I could make Y
>available to transcan].
>     What's your answer to that concern?

Answer to 1:  don't do it (imputation of <expected values> conditioned on Y).
If you do, then you will have an elevated estimate of association, be
it R^2 or chi-squares.  The flipped-out researchers would be right.

Here's a variation on case 4 that demonstrates this, a variation
on case 4:
Case 4:  Imputing with conditional mean of X2 | X1,Y
Again we ignore X1.
The scatterplot of Y vs X2 shows two groups:
(1) the completely observed points fall on a diagonal line plus
a bit of noise, 
(2) the imputed points fall exactly on the line

set.seed(2)
X2 _ rnorm(200)
Y _ .7*X2 + sqrt(1-.7^2)*rnorm(200)
cor(X2, Y) # correlation .701
X2[ 1:100 ] _ NA
X2i _ X2
w _ which.na(X2)
X2i[ w ] _ .7*Y[ w ] 
cor(X2i, Y) # correlation .835 -- too high
plot(X2i, Y)
points(X2i[w], Y[w], col=2)
abline( lm(Y~X2i, na.action=na.omit), col=3)
abline( lm(Y~X2, na.action=na.omit))


Answer to 2: Graphics might help, e.g. the second example to case 1,
or case 2, to show what what goes wrong when you don't condition on Y.
Emphasize that you are using the conditional distribution of X2 given
Y and the other variables, not the expected value, and show them case 3,
where things do look reasonable.

You may want to find another example for case 1, one in which the
regression slope (not only the intercept) is biased.  This may involve
some correlation between X1 and X2.
Here's one example, that keeps X1 and X2 independent, but for which
the value of Y affects whether X2 is missing:

rho _ .8; n _ 200
X2 _ rnorm(n)
Y _ rho*X2 + sqrt(1-rho^2)*rnorm(n)
X2[ Y > .5 ] _ NA
X2i _ X2
X2i[ which.na(X2) ] _ mean(X2, na.rm=T)
plot(X2i, Y)
abline( lm(Y~X2i, na.action=na.omit))
abline(1,rho,col=2)

I'll take a stab at an explanation too.
First, assume that there is some relationship between X2 and Y,
and in particular some relationship which is not explained by the
other variables.  This is, after all, why we would want to use
variable X2 when predicting Y.  But conversely, we would also
want to use that information when predicting X2 (using Y, and the
other variables).  To not use that information is akin to denying
that there is a relationship between X2 and Y, at least for the
observations for which the value of X2 is unobserved.
To impute X2 without using Y would be to create a data set in
which the relationship between X2 and Y was artificially weakened.

Now, "you" may be concerned that predicting X2 using Y, then turning
around and predicting Y using X2, would be circular, and lead to
overly-high estimates of association (chi-square or R^2).  You
would be right if we replaced value of X2 with their <expected values>
given Y.  But actually we replace X2 with its expected value plus
an appropriate amount of random variation.  Recall that R^2 has
the interpretation of "percent of variation explained"; we're 
adding in both the right amounts of explained variation and 
unexplained variation, in order to get the overall association
correct.

Tim


>Again, this is an enormous help to my understanding.
>With your permission I would like to update the currently 
>too short section in
>my book about imputation with extracts of several of your
>nice messages, with attributions of course.

Thanks, that's very flattering.  Go ahead.  I'd like to review
that, if possible, but a review is not mandatory (especially if
you're fighting a deadline at some point).

I should note that there are some details I've omitted.
Let's look at the simplest case, where there is only one
variable Y, with some observations missing completely
at random, and that we want to impute the missing values.
Suppose that Y is normal.
If mu and sigma are known, one would impute the missing values from
N(mu, sigma)
More likely, mu and sigma are not known.  One might estimate
them in the usual way, then impute from N(xbar, s).  But that
would tend to result in variability that is too small.
Instead, in the more rigorous random imputation done in Schafer's
book and the forthcoming S-Plus software, one would generate
random mu and sigma from their posterior distribution under an
appropriate model, then use them in generating random X's.
Under the non-informative prior,
sigma ~ s / sqrt( a chi-square with n-1 degrees of freedom / (n-1))
mu ~ N(xbar, sigma^2/n)
(Don't quote me on those, I'd need to check that I'm not missing
a constant somewhere.)

In other words, we can't generate random values from their
conditional distribution (which is unknown), but rather from their
conditional distribution given values of the unknown parameters,
which are generated from the conditional distributions of the parameters
given the observed data.

For multivariate problems, generating the random covariance matrix involves
the inverse of a random Wishhart variate.

For you to do everything in this more rigorous way would be a very
large effort.  We're involved in such an effort here.  I suggest that
you not do so in your software, but make appropriate notes in the
documentation, refer to Joe Schafer's code on statlib, and stick in
some advertising for the forthcoming S+MissingData Module :-).
As long as the fraction of missing data is small, the final results
wouldn't be too different.

>You've got me convinced that I need to make the default
>imputation method be random draws and including Y
>as a predictor of X_i.  I'm going to make the transcan function
>do this by back-solving on the original scale from the transformed
>normal(0,1) scale.

That sounds reasonable.  

>For categorical predictors I may add an
>option to do random draws from a multinomial distribution
>arising from a classification tree (from tree()).

That, or from the multinomial distribution obtained from a log-linear
model; but then the software or user has to specify which interactions
are included.

Regards, Tim
---------------------------------------CORRECTIONS------------------------
Frank,
   There were mistakes in my message of 29 May.
No conclusions are changed.
I'll preface the original message with ">", and give corrected versions.

(in the second part of Case 1)
> But in the more common case that X2 is Missing at Random (i.e. whether
> X2 is missing is independent of the value of Y | X2), it may be
  X2 is missing is independent of the value of X2 | Y), it may be

> that say large value of X2 are more likely to be missing.
> The regression slope is still unbiased, but the intercept is biased.
> 
> X2 _ rnorm(40)
> Y _ .99*X2 + sqrt(1-.99^2)*rnorm(40)
> X2[ X2 > .5 ] _ NA

  X2[ Y  > .5 ] _ NA

> X2i _ X2
> X2i[ which.na(X2) ] _ mean(X2, na.rm=T)
> plot(X2i, Y)
> abline( lm(Y~X2i, na.action=na.omit))


(in Case 2, Case 3, and twice in Case 4)

> X2[ X2 > .5 ] _ NA
  X2[ Y  > .5 ] _ NA