[Multiple imputation of missing at random data: General points and presentation of a Monte-Carlo method].Rev Epidemiol Sante Publique. 2009 Oct; 57(5):361-72.RE
Statistical analysis of a data set with missing data is a frequent problem to deal with in epidemiology. Methods are available to manage incomplete observations, avoiding biased estimates and improving their precision, compared to more traditional methods, such as the analysis of the sub-sample of complete observations.
One of these approaches is multiple imputation, which consists in imputing successively several values for each missing data item. Several completed data sets having the same distribution characteristics as the observed data (variability and correlations) are thus generated. Standard analyses are done separately on each completed dataset then combined to obtain a global result. In this paper, we discuss the various assumptions made on the origin of missing data (at random or not), and we present in a pragmatic way the process of multiple imputation. A recent method, Multiple Imputation by Chained Equations (MICE), based on a Monte-Carlo Markov Chain algorithm under missing at random data (MAR) hypothesis, is described. An illustrative example of the MICE method is detailed for the analysis of the relation between a dichotomous variable and two covariates presenting MAR data with no particular structure, through multivariate logistic regression.
Compared with the original dataset without missing data, the results show a substantial improvement of the regression coefficient estimates with the MICE method, relatively to those obtained on the dataset with complete observations.
This method does not require any direct assumption on joint distribution of the variables and it is presently implemented in standard statistical software (Splus, Stata). It can be used for multiple imputation of missing data of several variables with no particular structure.