------------------------------------------------------------------------------- help formvis7,uvis7Patrick Royston -------------------------------------------------------------------------------

Multivariate and univariate imputation sampling

mvis7mainvarlistusingfilename[.dta] [ifexp] [inrange] [weight],m(#)[boot[(varlist)]cc(ccvarlist)cmd(cmdlist)cycles(#)draw[(varlist)]genmiss(string)id(string)noconstanton(varlist)replaceseed(#)]

uvis7regression_cmdyvarxvarlist[ifexp] [inrange] [weight],gen(newvarname)[bootdrawreplaceseed(#)]where

regression_cmdmay be logistic, logit, mlogit, ologit, or regress.All weight types supported by

regression_cmdare allowed; see weights.

Description

mvis7(multivariateimputationsampling) imputes missing values inmainvarlistby using switching regression, an iterative multivariable regression technique. Sets of imputed and non-imputed variables are stored to a new file calledfilename. Any number of complete imputations may be created.

uvis7(univariateimputationsampling) imputes missing values in the single variableyvarbased on multiple regression onxvarlist.uvis7is called repeatedly bymvis7in a regression switching mode to perform multivariate imputation.The missing observations are assumed to be "missing at random" (MAR) or "missing completely at random" (MCAR), according to the jargon. See for example van Buuren

et al(1999) for an explanation of these concepts.

Options formvis7

m(#)is not optional.#is the number of imputations required (minimum 1, no upper limit).

boot[(varlist)] instructs that each member ofvarlist, a subset ofmainvarlist, be imputed with thebootoption ofuvis7activated. If(varlist)is omitted then all members ofmainvarlistwith missing observations are imputed using thebootoption ofuvis7.

cc(ccvarlist)prevents imputation of missing data inmainvarlistfor cases in which any member ofccvarlisthas a missing value. "cc" signifies "complete case". Note that members ofccvarlistare used for imputation if they appear inmainvarlist, but not otherwise. Use of this option is equivalent to enteringif~missing(var1) &~missing(var2{cmd:) ..., wherevar1,var2, ... denote the members ofccvarlist.

cmd(cmdlist)defines the regression commands to be used for each variable inmainvarlist, when it becomes the dependent variable in the switching regression procedure used byuvis7(seeRemarks). The first item incmdlistmay be a command such asregressor may have the syntaxvarlist:cmd, specifying that commandcmdapplies to all the variables invarlist. Subsequent items incmdlistmust follow the latter syntax, and each item should be followed by a comma.The default

cmdfor a variable islogitwhen there are two distinct values,mlogitwhen there are 3-5 andregressotherwise.Example:

cmd(regress)specifies that all variables are to be imputed byregress, over-riding the defaultsExample:

cmd(x1 x2:logit, x3:regress)specifies thatx1andx2are to be imputed bylogit,x3byregressand all others by their default choices

cycles(#)determines the number of cycles of regression switching to be carried out. Default#is 10.

draw[(varlist)] instructs that each member ofvarlistbe imputed with thedrawoption ofuvis7. If(varlist)is omitted then all relevant variables are imputed with thebootoption ofuvis7.

genmiss(string)creates an indicator variable for the missingness of data in any variable inmainvarlistfor which at least one value has been imputed. The indicator variable is set to missing for observations excluded byif,in, etc. The indicator variable forxvaris namedstringxvar.

id(string)creates a variable calledstringcontaining the original sort order of the data. Defaultstring:_i.

noconstantsuppresses the regression constant in all regressions.

on(varlist)changes the operation ofmvis7in a major way. With this option,uvis7imputes each member ofmainvarlistunivariately onvarlist. This provides a convenient way of producing multiple imputations when imputation for each variable inmainvarlistis to be done univariately on a set of complete predictors.

replacepermitsfilenameto be overwritten with new data.replacemay not be abbreviated.

seed(#)sets the random number seed to#. To reproduce a set of imputations, the same random number seed should be used. Default#: 0, meaning no seed is set by the program.

Options foruvis7

gen(newvar)is not optional.newvarcontains original (non-missing) and imputed (originally missing) values ofyvar.

bootinvokes a bootstrap method for creating imputed values (see Remarks).

drawdraws imputations at random from the posterior distribution of the missing values ofyvar, conditional on the observed values and the members ofxvarlist. The default method of imputation is by prediction matching (see Remarks).

replacepermitsnewvar(seegen(newvar)) to be overwritten with new data.replacemay not be abbreviated.

noconstantsuppresses the regression constant in all regressions.

seed(#)sets the random number seed to#. SeeRemarksfor comments on how to ensure reproducible imputations by using theseed()option. Default#: 0, meaning no seed is set by the program.

Remarks

uvis7imputesyvarfromxvarlistaccording to the following algorithm (see van Buuren et al (1999) section 3.2 for further technical details):1. Estimate the vector of coefficients (beta) and the residual variance by regressing the non-missing values of

yvaronxvarlist. Predict the fitted valuesetaobsat the non-missing observations ofyvar.2. Draw at random a value (sigma_star) from the posterior distribution of the residual standard deviation.

3. Draw at random a value (beta_star) from the posterior distribution of beta, allowing, through sigma_star, for uncertainty in beta.

4. Use beta_star to predict the fitted values

etamisat the missing observations ofyvar.5. (Prediction matching) For each missing observation of

yvarwith predictionetamis, find the non-missing observation ofyvarwhose prediction (etaobs) on observed data is closest toetamis. This closest non-missing observation is used to impute the missing value ofyvar.With the

bootoption, a variant on this algorithm is used. beta_star is estimated by regressingyvaronxvarlistafter taking a bootstrap sample of the non-missing observations. This has the advantage of robustness since the distribution of beta is no longer assumed to be muultivariate normal.With the

drawoption, another variant on the algorithm is used. The imputed values are predicted directly from beta_star, sigma_star and the covariates. This option assumes thatyvaris Normally distributed, given the covariates. The method is not robust to departures from Normality and may produce implausible imputations. It is provided mainly for pedagogic reaons, and also to deal with special situations in which the assumption of Normality is known to be reasonable.Note that

uvis7will not impute observations for which a value of a variable inxvarlistis missing. Only complete cases withinxvarlistare used.Missing data for ordered (or unordered) categorical covariates should be imputed by using the

ologit(ormlogit) command. In these cases, prediction matching is done on the scale of the mean absolute difference in the predicted class probabilities, preceded by logit transformation.

mvis7carries out multivariate imputation inmainvarlistusing regression switching (van Buuren et al 1999) as follows:1. Ignore any observations for which

mainvarlisthas only missing values, or for which any member ofccvarlist(if specified) has a missing value.2. For each variable in

mainvarlistwith any missing data, randomly order that variable and replicate the observed values across the missing cases. This step initialises the iterative procedure by ensuing that no relevant values are missing.3. For each variable in

mainvarlistin turn, impute missing values by applyinguvis7with the remaining variables as covariates.4. Repeat step 3

cycles()times, replacing the imputed values with updated values at the end of each cycle.A single imputation sample is created for each variable with any relevant missing values.

Van Buuren recommends

cycles(20)but goes on to say that 10 or even 5 iterations are probably sufficient. We have chosen a compromise default of 10."Multiple imputation" (MI) implies the creation and analysis of several imputed datasets. To do this, one would run

mvis7withmset to a suitable number, for example 5. To obtain final estimates of the parameters of interest and their standard errors, one would fit a model in each imputation and carry out the appropriate post-MI averaging procedure on the results from themseparate imputations. A suitable estimation tool for this purpose is micombine.

Further commentsAn interesting application of MI is to investigate possible models, for example prognostic models, in which selection of influential variables is required (Clark & Altman 2003). For example, the stability of the final model across the imputation samples is of interest.

In survival analysis, it is recommended to include the log of the survival time and the censoring indicator in the variables to be used for imputation. Van Buuren et al (1999) give a detailed discussion of the different types of covariate that can be included in the imputation model and discuss the important issue of how to deal with variables which are missing completely at random (MCAR), missing at random (MAR) and missing not at random (MNAR).

In the present implementation of multivariate imputation sampling in

mvis7, all the variables invarlistare used for imputation of all the others. This restriction could be lifted, but it is not clear that the additional complexity would pay off.See also Van Buuren's website http://www.multiple-imputation.com for further information and software sources.

Examples

. uvis7 regress y x1 x2 x3, gen(ym)

. mvis7 x1 x2 x3 using imputed, m(5)

. mvis7 x1 x2 x3 using imputed, m(5) cycles(20) cc(x4 x5)

. mvis7 x1-x5 using imputed, m(10) boot draw(x1 x2 x3) cmd(x1 x2:mlogit,x3:ologit) id(pid) seed(101) genmiss(m_)

AuthorPatrick Royston, MRC Clinical Trials Unit, London. patrick.royston@ctu.mrc.ac.uk

Referencesvan Buuren S., H. C. Boshuizen and D. L. Knook. 1999. Multiple imputation of missing blood pressure covariates in survival analysis.

Statistics in Medicine18:681-694. Also see http://www.multiple-imputation.com.Carlin J. B., N. Li, P. Greenwood, and C. Coffey. 2003. Tools for analyzing multiple imputed datasets.

Stata Journal3(3):226-244.Clark T. G. and D. G. Altman. 2003. Developing a prognostic model in the presence of missing data: an ovarian cancer case-study.

Journal ofClinical Epidemiology5628-37.Royston P. 2004. Multiple imputation of missing values.

Stata Journal4(3):227-241.

Also seeOn-line: help for mijoin7, micombine7, miset and related programs (if