Multivariate and univariate imputation sampling
mvis7 mainvarlist using filename[.dta] [if exp] [in range] [weight] , m(#) [ boot[(varlist)] cc(ccvarlist) cmd(cmdlist) cycles(#) draw[(varlist)] genmiss(string) id(string) noconstant on(varlist) replace seed(#) ]
uvis7 regression_cmd yvar xvarlist [if exp] [in range] [weight] , gen(newvarname) [ boot draw replace seed(#) ]
where
regression_cmd may be logistic, logit, mlogit, ologit, or regress.
All weight types supported by regression_cmd are allowed; see weights.
Description
mvis7 (multivariate imputation sampling) imputes missing values in mainvarlist by using switching regression, an iterative multivariable regression technique. Sets of imputed and non-imputed variables are stored to a new file called filename. Any number of complete imputations may be created.
uvis7 (univariate imputation sampling) imputes missing values in the single variable yvar based on multiple regression on xvarlist. uvis7 is called repeatedly by mvis7 in a regression switching mode to perform multivariate imputation.
The missing observations are assumed to be "missing at random" (MAR) or "missing completely at random" (MCAR), according to the jargon. See for example van Buuren et al (1999) for an explanation of these concepts.
Options for mvis7
m(#) is not optional. # is the number of imputations required (minimum 1, no upper limit).
boot[(varlist)] instructs that each member of varlist, a subset of mainvarlist, be imputed with the boot option of uvis7 activated. If (varlist) is omitted then all members of mainvarlist with missing observations are imputed using the boot option of uvis7.
cc(ccvarlist) prevents imputation of missing data in mainvarlist for cases in which any member of ccvarlist has a missing value. "cc" signifies "complete case". Note that members of ccvarlist are used for imputation if they appear in mainvarlist, but not otherwise. Use of this option is equivalent to entering if ~missing(var1) & ~missing(var2{cmd:) ..., where var1, var2, ... denote the members of ccvarlist.
cmd(cmdlist) defines the regression commands to be used for each variable in mainvarlist, when it becomes the dependent variable in the switching regression procedure used by uvis7 (see Remarks). The first item in cmdlist may be a command such as regress or may have the syntax varlist:cmd, specifying that command cmd applies to all the variables in varlist. Subsequent items in cmdlist must follow the latter syntax, and each item should be followed by a comma.
The default cmd for a variable is logit when there are two distinct values, mlogit when there are 3-5 and regress otherwise.
Example: cmd(regress) specifies that all variables are to be imputed by regress, over-riding the defaults
Example: cmd(x1 x2:logit, x3:regress) specifies that x1 and x2 are to be imputed by logit, x3 by regress and all others by their default choices
cycles(#) determines the number of cycles of regression switching to be carried out. Default # is 10.
draw[(varlist)] instructs that each member of varlist be imputed with the draw option of uvis7. If (varlist) is omitted then all relevant variables are imputed with the boot option of uvis7.
genmiss(string) creates an indicator variable for the missingness of data in any variable in mainvarlist for which at least one value has been imputed. The indicator variable is set to missing for observations excluded by if, in, etc. The indicator variable for xvar is named stringxvar.
id(string) creates a variable called string containing the original sort order of the data. Default string: _i.
noconstant suppresses the regression constant in all regressions.
on(varlist) changes the operation of mvis7 in a major way. With this option, uvis7 imputes each member of mainvarlist univariately on varlist. This provides a convenient way of producing multiple imputations when imputation for each variable in mainvarlist is to be done univariately on a set of complete predictors.
replace permits filename to be overwritten with new data. replace may not be abbreviated.
seed(#) sets the random number seed to #. To reproduce a set of imputations, the same random number seed should be used. Default #: 0, meaning no seed is set by the program.
Options for uvis7
gen(newvar) is not optional. newvar contains original (non-missing) and imputed (originally missing) values of yvar.
boot invokes a bootstrap method for creating imputed values (see Remarks).
draw draws imputations at random from the posterior distribution of the missing values of yvar, conditional on the observed values and the members of xvarlist. The default method of imputation is by prediction matching (see Remarks).
replace permits newvar (see gen(newvar)) to be overwritten with new data. replace may not be abbreviated.
noconstant suppresses the regression constant in all regressions.
seed(#) sets the random number seed to #. See Remarks for comments on how to ensure reproducible imputations by using the seed() option. Default #: 0, meaning no seed is set by the program.
Remarks
uvis7 imputes yvar from xvarlist according to the following algorithm (see van Buuren et al (1999) section 3.2 for further technical details):
1. Estimate the vector of coefficients (beta) and the residual variance by regressing the non-missing values of yvar on xvarlist. Predict the fitted values etaobs at the non-missing observations of yvar.
2. Draw at random a value (sigma_star) from the posterior distribution of the residual standard deviation.
3. Draw at random a value (beta_star) from the posterior distribution of beta, allowing, through sigma_star, for uncertainty in beta.
4. Use beta_star to predict the fitted values etamis at the missing observations of yvar.
5. (Prediction matching) For each missing observation of yvar with prediction etamis, find the non-missing observation of yvar whose prediction (etaobs) on observed data is closest to etamis. This closest non-missing observation is used to impute the missing value of yvar.
With the boot option, a variant on this algorithm is used. beta_star is estimated by regressing yvar on xvarlist after taking a bootstrap sample of the non-missing observations. This has the advantage of robustness since the distribution of beta is no longer assumed to be muultivariate normal.
With the draw option, another variant on the algorithm is used. The imputed values are predicted directly from beta_star, sigma_star and the covariates. This option assumes that yvar is Normally distributed, given the covariates. The method is not robust to departures from Normality and may produce implausible imputations. It is provided mainly for pedagogic reaons, and also to deal with special situations in which the assumption of Normality is known to be reasonable.
Note that uvis7 will not impute observations for which a value of a variable in xvarlist is missing. Only complete cases within xvarlist are used.
Missing data for ordered (or unordered) categorical covariates should be imputed by using the ologit (or mlogit) command. In these cases, prediction matching is done on the scale of the mean absolute difference in the predicted class probabilities, preceded by logit transformation.
mvis7 carries out multivariate imputation in mainvarlist using regression switching (van Buuren et al 1999) as follows:
1. Ignore any observations for which mainvarlist has only missing values, or for which any member of ccvarlist (if specified) has a missing value.
2. For each variable in mainvarlist with any missing data, randomly order that variable and replicate the observed values across the missing cases. This step initialises the iterative procedure by ensuing that no relevant values are missing.
3. For each variable in mainvarlist in turn, impute missing values by applying uvis7 with the remaining variables as covariates.
4. Repeat step 3 cycles() times, replacing the imputed values with updated values at the end of each cycle.
A single imputation sample is created for each variable with any relevant missing values.
Van Buuren recommends cycles(20) but goes on to say that 10 or even 5 iterations are probably sufficient. We have chosen a compromise default of 10.
"Multiple imputation" (MI) implies the creation and analysis of several imputed datasets. To do this, one would run mvis7 with m set to a suitable number, for example 5. To obtain final estimates of the parameters of interest and their standard errors, one would fit a model in each imputation and carry out the appropriate post-MI averaging procedure on the results from the m separate imputations. A suitable estimation tool for this purpose is micombine.
Further comments
An interesting application of MI is to investigate possible models, for example prognostic models, in which selection of influential variables is required (Clark & Altman 2003). For example, the stability of the final model across the imputation samples is of interest.
In survival analysis, it is recommended to include the log of the survival time and the censoring indicator in the variables to be used for imputation. Van Buuren et al (1999) give a detailed discussion of the different types of covariate that can be included in the imputation model and discuss the important issue of how to deal with variables which are missing completely at random (MCAR), missing at random (MAR) and missing not at random (MNAR).
In the present implementation of multivariate imputation sampling in mvis7, all the variables in varlist are used for imputation of all the others. This restriction could be lifted, but it is not clear that the additional complexity would pay off.
See also Van Buuren's website http://www.multiple-imputation.com for further information and software sources.
Examples
. uvis7 regress y x1 x2 x3, gen(ym)
. mvis7 x1 x2 x3 using imputed, m(5)
. mvis7 x1 x2 x3 using imputed, m(5) cycles(20) cc(x4 x5)
. mvis7 x1-x5 using imputed, m(10) boot draw(x1 x2 x3) cmd(x1 x2:mlogit, x3:ologit) id(pid) seed(101) genmiss(m_)
Author
Patrick Royston, MRC Clinical Trials Unit, London. patrick.royston@ctu.mrc.ac.uk
References
van Buuren S., H. C. Boshuizen and D. L. Knook. 1999. Multiple imputation of missing blood pressure covariates in survival analysis. Statistics in Medicine 18:681-694. Also see http://www.multiple-imputation.com.
Carlin J. B., N. Li, P. Greenwood, and C. Coffey. 2003. Tools for analyzing multiple imputed datasets. Stata Journal 3(3):226-244.
Clark T. G. and D. G. Altman. 2003. Developing a prognostic model in the presence of missing data: an ovarian cancer case-study. Journal of Clinical Epidemiology 5628-37.
Royston P. 2004. Multiple imputation of missing values. Stata Journal 4(3):227-241.
Also see
On-line: help for mijoin7, micombine7, miset and related programs (if