help for ice, uvisPatrick Royston -------------------------------------------------------------------------------

Multiple imputation by the MICE system of chained equations

Syntax

ice[mainvarlist] [if] [in] [weight] [,major_optionsless_used_options]

uviscmd{yvar|llvar ulvar}xvars[if] [in] [weight] [,options]

optionsDescription -------------------------------------------------------------------------icemajor_optionsclearclears the original data from memory and loads the imputed dataset into memorydryrunreports the prediction equations - no imputations are doneeq(eqlist)defines customised prediction equationsm(#)defines the number of imputationsmatch(varlist)prediction matching for each member ofvarlistpassive(passivelist)passive imputationsaving(filename[,replace])imputed and non-imputed variables are stored tofilenamestepwiseconstructs prediction equations by stepwise variable selectionswopts(stepwise_options)options forstepwise

icestepwise_optionsforwardperform forward-stepwise selectiongroup(group_list)create groups of variables for joint testing for addition or removallock(varlist)Variables to be kept in all modelspe(#)significance level for addition to a modelpr(#)significance level for removal from a modelshowshow each stepwise regression

iceless_used_optionsallmissingimputes in observations with all values inmainvarlistmissingboot(varlist)estimates regression coefficients forvarlistin a bootstrap sampleby(varlist)imputation within the levels implied byvarlistcc(varlist)prevents imputation of missing data in observations in whichvarlisthas a missing valuecmd(cmdlist)defines regression command(s) to be used for imputationconditional(condlist)conditional imputationcycles(#)determines number of cycles of regression switchingdebugassistance to debug individual regressionsdropmissingomits from the output all observations not in the estimation sampleeqdrop(eqdroplist)removes variables from prediction equationsgenmiss(string)creates missingness indicator variable(s)id(varname)createsvarnamecontaining the original sort order of the datainitialonlyimpute by random sampling from distribution of non-missing valuesinterval(intlist)imputes interval-censored variablesmatchpool(#)size of pool of potential matches for prediction mean matchingmonotoneassumes pattern of missingness is monotone, and creates relevant prediction equationsnoconstantsuppresses the regression constantnoppsuppresses special treatment of perfect predictionnoshoweqsuppresses presentation of prediction equationsnoverbosesuppresses messages showing the progress of the imputationsnowarningsuppresses warning messageson(varlist)imputes each member ofmainvarlistunivariatelyorderasisenters the variables in the order givenpersistignore errors when trying to impute "difficult" variables and/or modelsrestrict([varname] [if])fit models on a specified subsample, impute missing data for entire estimation sampleseed(#)sets random number seedsubstitute(sublist)substitutes dummy variables for multilevel categorical variablestrace(trace_filename)monitors convergence of the imputation algorithm

uvisoptionsgen(newvarname)creates variable containing imputations.Not optionalbootestimates regression coefficients in a bootstrap sampleby(varlist)imputation within the levels implied byvarlistmatchdoes prediction mean matchingmatchpool(#)size of pool of potential matches for prediction mean matchingnoppsuppresses special treatment of perfect predictionnoverbosesuppresses information about the imputation processreplaceoverwritesnewvarnameif it existsrestrict([varname] [if])fit models on a specified subsample, impute missing data for entire estimation sampleseed(#)sets random number seed -------------------------------------------------------------------------where

cmd(withuvis) may be intreg, logistic, logit, mlogit, nbreg, ologit, or regress.llvarulvarare required withintreg.An element of

mainvarlistforicetakes one of two forms:varnameor [i.|m.|o.]varname. Details are given in Special features for imputing categorical variables. Ifmainvarlistis omitted, variables and chained equations are input from special global macros; see theeq()andstepwiseoptions for details.

All weight-types are supported.

Please see mi ice, which does all thatStata 11 users:icedoes and a little bit more, and is conveniently integrated into the new mi system.

Description

iceimputes missing values inmainvarlistby using switching regression, an iterative multivariable regression technique. The abbreviation MICE means multiple imputation by chained equations, and was apparently coined by Stef van Buuren.iceimplements MICE for Stata. Sets of imputed and non-imputed variables are stored to a new file calledfilename. Any number of complete imputations may be created. The original data are stored infilenameas "imputation number 0" and the new variable_mjis set to 0 for these observations.

uvis(univariateimputationsampling) imputes missing values in the single variableyvarbased on multiple regression onxvars.uvisis called repeatedly byicein a regression switching mode to perform multivariate imputation.The missing observations are assumed to be "missing at random" (MAR) or "missing completely at random" (MCAR), according to the jargon. See for example van Buuren

et al(1999) for an explanation of these concepts.Please note that

iceanduvisrequire Stata 8.0 or higher. There have been incompatibility issues with Stata 7 or lower.

Special features for imputing categorical variablesThe prefixes

i.,m.ando.for a variable inice'smainvarlistare a convenience feature designed to simplify specification of the imputation model for categorical variables with three or more levels. You should hardly ever need to use Stata'sxidummy variable and interaction creator directly withicecommands, since dummy variables and more are adequately handled by using thei.,m.ando.prefixes.The prefix

i.ini.varnamemay be used only whenvarnamehas no missing data. It appliesxitoi.varnameto create the corresponding dummy variables. Ifvarnamehas missing data, imputation is required; either them.or theo.prefix (see below) should be used with such variables. See Pitfalls in using thei.prefix for further information.Use of

m.varnameoro.varnamesubstitutesi.form.oro.and appliesxi:toi.varname, at the same time tellingiceto impute missing values ofvarnameusing themlogitorologitcommands, respectively. Use of them.oro.prefixes also ensures that the corresponding dummy variables are used as predictors in imputation models for other variables (see substitute()) and are 'passively' imputed (see passive()). Suppose thatxis a multilevel categorical variable. Thenice o.xvarlist,optionsis expanded toxi: ice x i.xvarlist, substitute(x:i.x) cmd(x:ologit)options. Similary,ice m.xvarlist,optionsis expanded toxi: ice x i.xvarlist, substitute(x:i.x) cmd(x:mlogit)options.The resulting 'expanded' version of the

icecommand is stored in the$F9global macro. It can be retrieved if desired by pressing the F9 key.Note that the

i.,m.ando.prefixes are also valid with binary variables, although much less likely to be useful since one would not wish to impute a binary variable using eithermlogitorologit.

Options+---------------------+ ----+ ice (major options) +----------------------------------------------

clearclears the original data from memory and loads the imputed dataset. Unless thesaving()option is also specified, the data in memory are not permanently saved; this must then be done manually using the save or saveold commands.

dryruncausesiceto report the prediction equations it has constructed from the various inputs, but no imputations are done and no files are created. The option name ("dryrun") may be abbreviated asdry. It is not mandatory to specify an output file withsaving(filename)for a dry run. Sometimes the prediction equation set-up needs to be carefully checked before running what may be a lengthy imputation process. Note that stepwise selection of prediction equations (stepwiseoption) still works whendryrunhas been specified.

eq(eqlist)allows one to define prediction equations for any subset of variables inmainvarlist. Theeq()option, particularly when used withpassive(), allows great flexibility in the possible imputation schemes. Note thateq()takes precedence over all default definitions and assumptions about the way a given variable inmainvarlistis to be imputed. If thepassive()andsubstitute()options are not invoked, the default set of equations is that each variable inmainvarlistwith any missing data is imputed from all other variables inmainvarlist.When

eq()is specified, the syntax ofeqlistisvarname1:varlist1[,varname2:varlist2...] where eachvarname#(orvarlist#) is a member (or subset) ofmainvarlist. Variable names prefixed byi.are allowed, provided that the names were prefixed byi.,m.oro.inmainvarlist. They are translated to the corresponding dummy variables created byxi:.A 'blank' (null, constant-only) equation is specified as

_cons, for example,eq(x4 x5:_cons). Such equations are reported in the table of prediction equations as "[Empty equation]". The prediction model for variables with empty equations is simply_cons.If

mainvarlistis omitted,icetakesmainvarlistfrom the global macro$ice_mainand the equations, regression commands and predicted variables from global macros$ice_eq#,$ice_cmd#and$ice_x#, respectively, for#= 1, ...,$ice_neq. The number of equations is stored in$ice_neq. These macros are created automatically whenice'sstepwiseoption is used (see details understepwise). They may also be user-defined. The macros may be inspected in Stata by using the commandmacro list ice_*.

m(#)defines#as the number of imputations required (minimum 1, no upper limit). The default#is 1.

match[(varlist)] instructs that each member ofvarlistbe imputed with thematchoption ofuvis. This provides prediction matching for each member ofvarlist. If(varlist)is omitted then all relevant variables are imputed with thematchoption ofuvis. The default, ifmatch()is not specified, is to draw from the posterior predictive distribution of each variable requiring imputation.

passive(passivelist)allows the use of "passive" imputation of variables that depend on other variables, some of which are imputed. The syntax ofpassivelistisvarname:exp[\varname:exp...]. Notice the requirement to use "\" as a separator between items inpassivelist, rather than the usual comma; the reason is that a comma may be a valid part of an expression. The option is most easily explained by example. Suppose x1 is a categorical variable with 3 levels, and that two dummy variables x1a, x1b have been created by the commands

. generate byte x1a=(x1==2). generate byte x1b=(x1==3)Now suppose that x1 is to be imputed by the

mlogitcommand, and is to be treated as the two dummy variables x1a and x1b when predicting other variables. Use ofmlogitis achieved by the optioncmd(x1:mlogit). When x1 is imputed, we want x1a and x1b to be updated with new values which depend on the imputed values of x1. This may be achieved by specifyingpassive(x1a:x1==2 \ x1b:x1==3). It is necessary also to remove x1 from the list of predictors when variables other than x1 are being imputed, and this is done by using thesubstitute()option; in the present example, you would specifysubstitute(x1:x1a x1b).Note that although in this example x1a will take the (possibly unintended) value of 0 when x1 is missing,

iceis careful to ensure that x1a (and x1b) inherit the missingness of x1, and are passively imputed following active imputation of missing values of x1. If this were not done, incorrect results could occur. The responsibility of the user is to create x1a and x1b before runningicesuch that their missing values are identical to those of x1.A second example is multiplicative interactions between variables, for example, between x1 and x2 (e.g. x12=x1*x2); this could be entered as

passive(x12:x1*x2). It would cause the interaction term x12 to be omitted when either x1 or x2 was being imputed, since it would make no sense to impute x1 from its interaction with x2.substitute()is not needed here.It should be stressed that variables to be imputed passively must already exist and must be included in

mainvarlist, otherwise they are not recognised. Passive variables may be defined in terms of variables inmainvarlistand variables not inmainvarlist, although it would of course make no sense not to involve at least one variable inmainvarlist.

saving(filename[,replace])saves the imputation tofilename.replaceallowsfilenameto be overwritten with new data.replacemay not be abbreviated.

stepwiseconstructs prediction equations by stepwise variable selection among members ofmainvarlist. There are 3 steps to the process. First,icecreates a dataset with 1 imputation using a randomly drawn subset of values from the distribution of each variable with missing values. (This is the standard initialisation step forice, and is invoked automatically by theinitialonlyoption.) Next,icerunsstepwiseto select variables for each prediction equation. Binary dummy variables are treated appropriately. By default, forward selection at a 5% significance level is used; see theswopts()option for other possibilities. Finally,iceretrieves the reduced equations and performs imputation with them as usual.Using

stepwisealso causesiceto storemainvarlist, the selected equations, variables and commands in global macros called$ice_*, as described under theeq()option.

swopts(stepwise_options)allows the followingstepwise_optionsfor use withstepwise:forward,group(group_list),lock(varlist),pe(#),pr(#)andshow. Note that onlype(#),pr(#)andforwardare standard options of Stata'sstepwisecommand; the remainder are used to group variables for joint testing for inclusion or exclusion from the models, to construct a list of variables formatted for use withstepwise'slockterm1option, and to show the output fromstepwise. Further details of individual options are given below underice(stepwise options).Specifying neither

pe(#)norpr(#)is equivalent to specifyingpe(0.05), i.e. the default method is forward selection of variables significant at the 5% level.Note that variables in

mainvarlistthat have the prefixi., indicating that they are categorical, are to be represented by their dummy variables and have no missing data, should retain theiri.prefix when they are included in thegroup()orlock()options.+------------------------+ ----+ ice (stepwise options) +-------------------------------------------

forwardspecifies the forward-stepwise method and may be specified only when bothpr()andpe()are also specified. Specifying bothpr()andpe()withoutforwardresults in backward-stepwise selection. Specifying onlypr()results in backward selection, and specifying onlype()results in forward selection.

group(group_list)specifies variables always to be tested jointly for inclusion or exclusion from models. An element ofgroup_listis avarlist, and elements are separated by commas, for examplegroup(x1i.x2, y1 y2). Such groups of variables (or, in the case of categorical variables prefixed withi., their implied dummy variables) are surrounded by parentheses when presented tostepwisefor analysis.

lock(varlist)specifies variables to be kept in all models. Such variables are surrounded by parentheses when presented tostepwisefor analysis. Thelockterm1option ofstepwiseis applied to them.

pe(#)specifies the significance level for addition to the model; terms with p <pe()are eligible for addition.

pr(#)specifies the significance level for removal from the model; terms with p >=pr()are eligible for removal.

showdisplays the output fromstepwisefor each regression analysis to develop the prediction equations used byice.+-------------------------+ ----+ ice (less used options) +------------------------------------------

allmissingimputes missing values in observations in which all variables inmainvarlistare missing. The default is to leave such values as missing.

boot[(varlist)] instructs that each member ofvarlist, a subset ofmainvarlist, be imputed with thebootoption ofuvisactivated. If(varlist)is omitted then all members ofmainvarlistwith missing observations are imputed using thebootoption ofuvis.

by(varlist)performs multiple imputation separately for all combinations of variables invarlist. Observations with missing values for any members ofvarlistare excluded. May be combined withrestrict().

cc(varlist)prevents imputation of missing data inmainvarlistfor cases in which any member ofvarlisthas a missing value. "cc" signifies "complete case". Note that members ofvarlistare used for imputation if they appear inmainvarlist, but not otherwise. Use of this option is equivalent to enteringif~missing(var1) &~missing(var2)..., wherevar1,var2, ... denote the members ofvarlist.

cmd(cmdlist)defines the regression commands to be used for each variable inmainvarlist, when it becomes the dependent variable in the switching regression procedure used byuvis(see Algorithm used by uvis). The first item incmdlistmay be a command such asregressor may have the syntaxvarlist:cmd, specifying that commandcmdapplies to all the variables invarlist. Subsequent items incmdlistmust follow the latter syntax, and each item should be followed by a comma.The default

cmdfor a variable islogitwhen there are two distinct values,mlogitwhen there ar 3-5 andregressotherwise.Example:

cmd(regress)specifies that all variables are to be imputed byregress, over-riding the defaultsExample:

cmd(x1 x2:logit, x3:regress)specifies thatx1andx2are to be imputed bylogit,x3byregressand all others by their default choices

Advanced use: If acmdis implicitly defined for a variable by ao.orm.prefix and thecmd()option is used explicitly for that same variable then the explicit use takes precedence over the implicit use. For example, the combination ...o.x1, cmd(x1:regress)would imputex1withregressrather than with the implicitologit. Used withmatch(x1), this would give a reasonable alternative to ordinal logistic regression for imputing an ordered categorical variablex1.

conditional(condlist)invokes conditional imputation. Each item ofcondlisthas the formvarlist:condition. Items are separated by backslash (\). The idea is that members ofvarlistare only informative whenconditionis true, and that they take somepre-determined valuewhenconditionis false.Important: This option was not correctly implemented in versions of

ice_before 1.2.2 – use which ice_ to check your version.Conditional imputation requires that (i) when any variable included in

conditionis missing, all variables invarlistare missing, and (ii) whenconditionis false, each variable invarlisttakes only one value (thepre-determined value, which might be 0 or a unique "not-applicable" code such as 99).In detail, members of

varlistare imputed in the usual way for the subset of observations for whichifconditionis true (i.e.conditionevaluates to a non-zero quantity). For the subset of observations for whichifconditionis false, thepre-determined valueis identified from the data for each member ofvarlistand is used to impute any missing values for that variable. An example is given below.

conditionis a Stata expression constructed so thatifconditioncan be evaluated for the current dataset. Variables appearing inconditionmay be members ofmainvarlistor merely variables in the dataset. The only other situation inicein which variables that do not appear inmainvarlistmay be used is described under thepassive()option.Consider a simple example, a dataset comprising three incomplete variables

age,female, andpregnant, wherefemaleis 1 for females, 0 for males, andpregnantis 1 for pregnant, 0 for not pregnant. Since males can't be pregnant, we wish to impute missing values ofpregnantusing only data from females. If we impute someone with missing gender as male, we want their pregnancy status always to be imputed as non-pregnant. If males are simply coded as non-pregnant then thepre-determined valueis the value ofpregnantdenoting non-pregnant, i.e. 0; if instead males are coded as pregnant=99 then thepre-determined valueis 99. In either case, we implement the conditional imputation as follows:

. ice age pregnant female, conditional(pregnant: female==1) clearHere, the prediction equation for

ageispregnant female, that for female isageand that forpregnantisage if female==1. Observations ofpregnantfor originally missing observations offemalenow imputed as male (i.e.female= 0) are assigned the value 0 byice.We can have dependent conditional imputation. For example, suppose a fertility test

fertile, taking the value 1 for fertile and 0 for infertile, was available just for females. We might code this as follows:

. ice age pregnant female fertile, conditional(pregnant: female==1 &fertile==1 \ fertile: female==1) clearwhich reflects that only fertile females can become pregnant, and only females have a fertility test.

cycles(#)determines the number of cycles of regression switching to be carried out. Default#is 10.

debugprovides assistance for debugging individual regressions. Asiceruns, it prints out, for each imputation and cycle, the name of the regression command, the variable being imputed and R2, the explained variation of the model (Nagelkerke method). At the same time, the values from the last cycle only are stored in a new file called_ice_debug.dta, in the current working directory. A plot of R2 against cycle number may indicate abnormalities; for example if R2 shows instability, the corresponding model may have some features that need improving. The option is useful also for detecting regression models that explain a negligible amount of variation; such models are candidates for deletion.Because only the final cycle is stored, for debugging purposes it may be most sensible to use the

debugoption with, say,cycles(100)andm(1).

dropmissingis a feature designed to save memory when using the file of imputed data created byice. It omits fromfilenameall observations which are not in the estimation sample, that is for which either (i) they are filtered out byiforin, or a non-positive weight, or (ii) the values of all variables inmainvarlistare missing. This option provides a "clean" analysis file of imputations, with no missing values. Note that the observations not in the estimation sample are omitted also from the original data, stored as imputation #0 infilename.

eqdrop(eqdroplist)deletes variables from prediction equations. The syntax ofeqdroplistisvarname1:varlist1[,varname2:varlist2...] where eachvarname#(orvarlist#) is a member (or subset) ofmainvarlist. One can only remove predictors from equations for variables with missing values (although trying to remove predictors from non-existent equations is not a fatal error - an information message is issued). Variable names prefixed byi.are allowed, provided that the names were prefixed byi.,m.oro.inmainvarlist. They are translated to the corresponding dummy variables created byxi:.

genmiss(string)creates an indicator variable for the missingness of data in any variable inmainvarlistfor which at least one value has been imputed. The indicator variable is set to missing for observations excluded byif,in, etc. The indicator variable forxvaris namedstringxvar. The information on missingness is implicit in the original data, which is stored as "imputation 0".

id(newvarname)creates a variable callednewvarnamecontaining the original sort order of the data. Defaultnewvarname:_mi.

interval(intlist)imputes interval-censored variables. An interval-censored value is one which is known to lie in an interval [a,b] where a and b are finite and a <= b, or in (-infinity,b] or in [a,infinity). When either terminal is infinite we have left or right censoring, respectively.intlisthas the syntaxvarname:llvar ulvar[,varname:llvar ulvar...], where eachvarnameis an interval-censored variable, eachllvarcontains the lower bound (a) forvarnameand eachulvarcontains the upper bound (b) forvarname(or a missing value to represent plus or minus infinity). The supplied values ofvarnameare irrelevant since they will be replaced anyway; it is only required thatvarnameexist. Observations withllvarmissing andulvarpresent are left-censored forvarname. Observations withllvarpresent andulvarmissing are right-censored forvarname. Observations withllvar=ulvarare complete, and no imputation is done for them. Observations with bothllvarandulvarmissing are imputed assuming an uncensored normal distribution. See Interval censoring for further information.

initialonlyimputes by random sampling from the distribution of the non-missing values of each variable which has missing value(s). This is the initialisation step of the MICE algorithm (see Remarks). This option may be used to get a 'quick and dirty' set of multiple imputations with which to explore initial impressions of the analysis model, or to investigate possible prediction equations for subsequent multiple imputation using the MICE method. The prediction equations that are displayed are the ones that would be used by default in a full MICE imputation run; with theinitialonlyoption, they are ignored when imputations are produced.

matchpool(#)modifies the implementation of thematch()option.matchperforms predictive mean matching in which a pool of potential matches is constructed and one member of this pool is sampled (with equal probabilities).#specifies the size of this pool. The default is 3. Please note that older versions oficeused#= 1.

monotoneassumes the members ofmainvarlisthave a monotone missingness pattern, that is,icedefines the prediction equations appropriately. For variables x1, ..., xk the imputation equations would be x1 on [nothing], x2 on x1, x3 on x1 x2, ... , xk on x1 x2 ... x(k-1). When the missingness really is monotonic, only one cycle of MICE is required, so the default here iscycles(1). There is no advantage in specifying more than one cycle.With the

monotoneoption,icereports a 'non-monotonicity score'. This is defined as 100 * (sum of numerators) / (sum of denominators), where the sums are taken over all adjacent pairs of variables inmainvarlist. Consider two variables, x1 and x2. The numerator for x1 and x2, i.e the non-monotonicity, is the number of observations in the estimation sample for which x1 is missing and x2 is observed. If the numerator is positive, x1 and x2 show a non-monotonic pattern. The denominator for x1 and x2 is the the number of observations in the estimation sample for which x2 is observed.

icetakes a relaxed view of runs in which the non-monotonicity score is positive. It warns the user but goes ahead with the imputation anyway - it assumes that the user knows what they are doing.

noshoweqsuppresses the presentation of the prediction equations.

noconstantsuppresses the regression constant in all regressions.

noppsuppresses treatment of the perfect prediction bug (see Avoiding the perfect prediction bug).

noverbosesuppresses display of the imputation number (as#) and cycle number within imputations (as.) which show the progress of the imputations.

nowarningsuppresses warning messages.

on(varlist)changes the operation oficein a major way. With this option,uvisimputes each member ofmainvarlistunivariately onvarlist. This provides a convenient way of producing multiple imputations when imputation for each variable inmainvarlistis to be done univariately on a set of complete predictors.

orderasisenters the variables inmainvarlistinto the MICE algorithm in the order given. The default is to order them according to the number of missing values: the variable with least missingness gets imputed first, and so on.

persistcausesiceto ignore errors raised byuviswhen trying to impute a "difficult" variable, or impute with a model that is difficult to fit to the data to hand. Trying to impute a "difficult" variable using theologitormlogitcommand is the most common cause of failure. By default,icestops with an error message. Withpersist,icecontinues to the next variable to be imputed, not updating the variable that raised an error. Often, by the play of chance, the "difficult" variable is successfully updated in a subsequent cycle, and no damage is done to the imputation process.If the error for a given variable appears in every cycle, you should consider changing the prediction equation for that variable, since its imputed values are unlikely to be appropriate.

We do not recommend the routine use of

persist. Only use it when it appears that there is sporadic failure to fit an imputation model.

restrict([varname] [if])specifies that imputation models be computed using the subsample identified byvarnameandif.The subsample is defined by the observations for which

varname!=0 that also meet theifconditions. Typically,varname=1 defines the subsample andvarname=0 indicates observations not belonging to the subsample. For observations whose subsample status is uncertain,varnameshould be set to a missing value; such observations are dropped from the subsample.By default

icefits imputation models and imputes missing values using the sample of observations identified in the [if] [in] options. Therestrict()option identifies a subset of this sample to be used for model estimation. Imputation is restricted to the sample identified in the [if] [in] options. Thus, predictions and their associated imputations are made 'out-of-sample' with respect to the subsample defined byrestrict().Be careful to avoid restrictions that prevent prediction for all the relevant observations. For example, models that involve

mlogitwill fail to predict 'everywhere' if therestrict()option excludes any of the levels of the target variable, as in the following example.schoolis a four-level categorical variable coded 0, 1, 2, 3:

. gen byte ok = (school > 0) if !missing(school). ice school house age sex bcg, clear restrict(ok)By default,

schoolis imputed usingmlogit. Predictions cannot be made for observations withschool==0.icewill halt with error #303 (equation not found).

seed(#)sets the random number seed to#. In order to reproduce a set of imputations, the same random number seed should be used. See Reproducibility of results from uvis and ice for further comments. Default#: 0, meaning no seed is set by the program; depending on the status of Stata's random number seed, different sets of imputations should be obtained on each run.

substitute(sublist)is typically used with thepassive()option to represent multilevel categorical variables as dummy variables in models for predicting other variables. Seepassive()for more details. The syntax ofsublistisvarname:dummyvarlist[,varname:dummyvarlist...] wherevarnameis the name of a variable to be substituted anddummyvarlistis the list of dummy variables representing it.Note, however, the following important convenience feature:

substitute()may be used without corresponding expressions inpassive()to recreate dummy variables automatically. If the values of variables indummyvarlistare NOT defined through expressions involvingvarnamein thepassive()option, then the variables indummyvarlistare calculated according to the actual range of values ofvarname. For example, suppose the optionspassive(x1a:x1==2 \x1b:x1==3)andsubstitute(x1:x1a x1b)were specified. Provided that all the non-missing values ofx1were 2 whenx1a==1 and all the non-missing values ofx1were 3 whenx1b==1, thenpassive(x1a:x1==2 \x1b:x1==3)is implied bysubstitute(x1:x1a x1b)and can be omitted. The rule applied bysubstitute(x:dummy1 [dummy2...])for defining dummy variables dummy1, dummy2, ... is as follows:1. Determine the range of values [xmin, xmax] of x for which dummy1 > 0.

2a. If xmin < xmax, define dummy1 to be 1 if xmin <= x <= xmax and 0 otherwise.

2b. If xmin = xmax, define dummy1 to be 1 if x = xmin and 0 otherwise.

3. Repeat steps 1 and 2a,b for dummy2, dummy3, ... as necessary.

With many such categorical variables this feature can save a lot of typing.

trace(trace_filename)monitors the convergence of the imputation algorithm. For each original variable with missing values, the mean of the imputed values is stored as a variable intrace_filename, together with the cycle number at which that mean was calculated. The results are stored only for the final imputation. For diagnostic purposes, it is sensible to runtrace()withm(1)and a large number of cycles, such ascycles(100). When the run is complete, it is helpful to loadtrace_filenameinto memory and plot the mean for each imputed variable against the cycle number. If necessary, smoothing may be applied to clarify any apparent pattern. Convergence is judged to have occurred when the pattern of the imputed means is random. It is usually obvious from the appearance of the plot how many cycles are needed for convergence.

+------+ ----+ uvis +-------------------------------------------------------------

bootinvokes a bootstrap method for creating imputed values (see bootstrap).

by(varlist)performs imputation separately for all combinations of variables invarlist. Observations with missing values for any members ofvarlistare excluded. May be combined withrestrict().

gen(newvar)is not optional.newvarcontains original (non-missing) and imputed (originally missing) values ofyvar.

matchcreates imputations by prediction matching. The default is to draw imputations at random from the posterior distribution of the missing values ofyvar, conditional on the observed values and the members ofxvars. See match for further details.

matchpool(#)- see matchpool for details.

noconstantsuppresses the regression constant in all regressions.

noverbosesuppresses non-error messages whileuvisis running.

replacepermitsnewvar(seegen(newvar)) to be overwritten with new data.replacemay not be abbreviated.

restrict([varname] [if])specifies that the imputation model be computed using the subsample identified byvarnameandif.The subsample is defined by the observations for which

varname!=0 that also meet theifconditions. Typically,varname=1 defines the subsample andvarname=0 indicates observations not belonging to the subsample. For observations whose subsample status is uncertain,varnameshould be set to a missing value; such observations are dropped from the subsample.By default

uvisfits the imputation model using the sample of observations identified in the [if] [in] options. Therestrict()option identifies a subset of this sample.

seed(#)sets the random number seed to#. See Reproducibility of results from uvis and ice for comments on how to ensure reproducible imputations by using theseed()option. Default#: 0, meaning no seed is set by the program.

RemarksWhen

cmdisregress,uvisimputesyvarfromxvarsaccording to the following algorithm (see van Buuren et al (1999) section 3.2 for further technical details):1. Estimate the vector of coefficients (beta) and the residual variance by regressing the non-missing values of

yvaron the current "completed" version ofxvars. Predict the fitted valuesetaobsat the non-missing observations ofyvar.2. Draw at random a value (sigma_star) from the posterior distribution of the residual standard deviation.

3. Draw at random a value (beta_star) from the posterior distribution of beta, conditional on sigma_star, thus allowing for uncertainty in beta.

4. Use beta_star to predict the fitted values

etamisat the missing observations ofyvar.5. The imputed values are predicted directly from beta_star, sigma_star and the covariates. For imputation by linear regression, this step assumes that

yvaris Normally distributed, given the covariates. For other types of imputation, samples are drawn from the appropriate distribution.With the

matchoption, step 5 is replaced by the following. For each missing observation ofyvarwith predictionetamis, find the non-missing observation ofyvarwhose prediction (etaobs) on observed data is closest toetamis. This closest non-missing observation is used to impute the missing value ofyvar.The default draw method is not robust to departures from Normality and may produce implausible imputations. For example, if the original distribution is skew and positive-valued, the imputed distribution will not necessarily have the appropriate amount of skewness, nor will all the imputed values necessarily be positive. Log transformation of positive variables may greatly improve the appropriateness of the imputations.

The alternative

matchmethod is recommended only for continuous variables when the Normality assumption is clearly untenable, even approximately. It is not necessary, nor is it implemented, for binary, ordered categorical or nominal variables.matchmay work well when the distribution of a continuous variable is very non-Normal, but it may sometimes result in biased imputations.With the

bootoption, steps 2-4 are replaced by a bootstrap estimation of beta_star and sigma_star, obtained by regressingyvaronxvarsafter taking a bootstrap sample of the non-missing observations. This has the advantage of robustness since the distribution of beta is no longer assumed to be multivariate normal.Note that

uviswill not impute observations for which a value of a variable inxvarsis missing. However, all original (missing or non-missing) observations ofyvarwill be copied intonewvarnamein such cases. This is a change from the first release ofuvis(withmvis). Previously,newvarnamewould be set to missing whenever a value of a variable inxvarswas missing, irrespective of the value ofyvar.Missing data for ordered (or unordered) categorical covariates should be imputed by using the

ologit(ormlogit) command.matchis neither required nor implemented in these cases.

icecarries out multivariate imputation inmainvarlistusing regression switching (van Buuren et al 1999) as follows:1. Ignore any observations for which

mainvarlisthas only missing values, or if thecc(varlist)option has been specified, for which any member ofvarlisthas a missing value.2. For each variable in

mainvarlistwith any missing data, randomly order that variable and replicate the observed values across the missing cases. This step initialises the iterative procedure by ensuing that no relevant values are missing.3. For each variable in

mainvarlistin turn, impute missing values by applyinguviswith the remaining variables as covariates.4. Repeat step 3

cycles()times, replacing the imputed values with updated values at the end of each cycle.A single imputation sample is created for each variable with any relevant missing values.

Van Buuren recommends

cycles(20)but goes on to say that 10 or even 5 iterations are probably sufficient. We have chosen a compromise default of 10."Multiple imputation" (MI) implies the creation and analysis of several imputed datasets. To do this, one would run

icewithmset to a suitable number, for example 5. To obtain final estimates of the parameters of interest and their standard errors, one would fit a model in each imputation and carry out the appropriate post-MI averaging procedure on the results from themseparate imputations. A suitable estimation tool for this purpose is mim.

Handling the outcome variableTo avoid bias, the outcome variable must always be included in the list of variables to be used for imputation. In survival analysis, in particular, it is essential to include the censoring indicator as well as the survival time. van Buuren et al (1999) recommend a log transformation of the survival time, apparently a heuristic choice. We have shown (White & Royston 2008) that for a single binary predictor and a proportional hazards analysis model, the correct imputation model comprises the baseline cumulative hazard, the censoring indicator and the binary predictor. The theory remains approximately valid for a normally distributed predictor with a weak effect. More complex cases have not yet been investigated, but at least some guidance is now available.

Handling binary variablesBinary variables present no difficulty. By default, in the MICE procedure, when such a variable is the response, it is predicted from other variables by using logistic regression; when it is a covariate, it is modelled in the only way possible, effectively as a single dummy variable.

Ensure that binary variables are coded 0/1. Although, in theory, one could use

ologitormlogitto model them, in practice there is no advantage in doing so. Furthermore, do not use thei.prefix with binary variables, since there is a speed penalty in doing so.

Handling categorical variablesCategorical variables with 3 or more levels may in principle be treated in different ways. By default, in

icevariables with 3-5 levels are modelled using multinomial logistic regression (mlogitcommand) when the response, and as a single linear term when a covariate. The same behaviour occurs with the ordered logistic model (ologitcommand). Our recommended strategy is to use them.oro.prefixes for variables to be imputed using unordered or ordered logistic regression. This approach removes the need to define thesubstitute()andpassive()options, both of which can be tedious and error-prone to type.You should be aware that unless the dataset is large, use of the

mlogitcommand may produce unstable estimates if the number of levels is too large, and may compromise the accuracy of the imputations. It is hard to predict when this will occur.Values of a variable y that are interval censored are imputed under the assumption that y is normally distributed with unknown mean and variance. The method, which is fast and efficient, is essentially as described for right-censored variables in section 3.3 of Royston (2001). A minor extension to allow left or interval censoring is employed. For example, if A < y < B and A and B are both finite, the imputed value for y will follow a truncated normal distribution with bounds A and B, variance parameter estimated from the data and mean given by the linear predictor for the imputation model for y. Stata's

intregcommand is used to estimate the mean and variance of y. When A and B are both missing (infinite), imputation of y simply assumes the normal distribution just mentioned, but without bounds.If you wish to impose range limits on the imputed values, the lower and upper bound variables may be set accordingly. For example, to impute right-censored (e.g. survival) data, you would set

llvarequal to all the observed times to event, whether censored or not, andulvarto all the uncensored event times and missing for the censored times. This would cause the right-censored values to be imputed without restriction. If you wanted to bound the imputed values above, say by 10, you would specifyulvarto be 10 (rather than missing) for all the censored observations.

Avoiding the perfect prediction bugPerfect prediction may arise in

logistic,ologitormlogitregression models when a (usually categorical) predictor variable perfectly predicts success or failure in the outcome variable. Inice, perfect prediction may occur without the user's knowledge because a large number of regression models are run silently. Perfect prediction may lead to entirely inappropriate imputations. To avoid this,uvischecks for perfect prediction; if it is detected,uvistemporarily augments the data with a small number of extra observations with low weight, in such a way as to remove the perfect prediction. A message is displayed noting the variable that has the perfect prediction issue, and that the problem has been dealt with. Such treatment of the perfect prediction bug may be switched off, if desired, by using thenoppoption.

Errors and diagnostics

icemay occasionally detect an anomaly when runninguviswith a particular variable as response and a particular regression command.icewill then stop and report theuviscommand it was running and the error number returned. Also,icesaves to a file called_ice_dump.dtain the working directory a snapshot of the data it was using when the error occurred, while also reporting theuviscommand it was executing. Sometimes the problem lies in a regression of a binary or categorical variable where the estimation procedure fails to converge; this is usually caused by sparse cell occupancy of the response variable. If you obtain this error you should either omit the offending variable from the imputation, or seek to combine a sparse category with another category.Another possibility is that, again due to a defect in a particular regression command in the chained equations structure, the number of values imputed for a particular variable is less than expected. This is a serious error and again may arise from estimation problems involving a binary or categorical variable. In this situation,

iceagain saves to a file called_ice_dump.dtain the working directory a snapshot of the data it was using in the attempted estimation, while reporting theuviscommand it was executing. You can then investigate what may have gone wrong with the command by loading the data in_ice_dump.dtaand re-running the offending regression command.

Reproducibility of results from uvis and iceUse of the option

seed(#)ensures that a set of imputed values is reproduced identically for a given value of#. This is true for bothuvisandice.Please report to the author any instances where use of

iceoruviswith a fixed seed does not produce the same set of imputed values.

Pitfalls in using the i. prefix

icecommands that includei.varnameinmainvarlistneed to be handled with awareness. Ifvarnamehas no missing data in the estimation sample, expected results are obtained. Ifvarnamedoes have missing values in the estimation sample, an error message is given andicestops. The "estimation sample" here is the set of observations for which at least one variable inmainvarlisthas non-missing value(s).The presence of

i.evokesxi, which expandsi.varnamein the usual way to create_Ivarname_#dummy variables. Sincevarnamehas no missing data, the dummy variables are included in the prediction equations for other variables inmainvarlist, as required.If

i.varnamewere allowed to have missing data in the estimation sample,xiexpansion would occur as before, but each of the_Ivarname_#dummy variables would become a response variable in a prediction equation and would be predicted individually (using logistic regression). Worse, the prediction equation for each dummy variable would include theotherdummy variables fromi.varname. That is clearly nonsense.The advice, as always, is (a) to use

dryrunbefore 'production' runs if theicecommand is at all complex, and then (b) carefully to check thatice's table of prediction equations is both sensible and what you expected.

Further notes

icesaves all the variables in the current data to the output, whether or not they are involved in the imputation procedure. This can make the resulting dataset very large. It may therefore be sensible to drop variables not subsequently needed for modelling before runningice.

icedetermines the order of imputing variables in the cycle of chained equations according to the amount of missing data. Variables with the least missingness are imputed first. Variables with the same amount of missingness are processed in an arbitrary order, but always in the same order. Note that ificeis run twice using identical variables (at least two of which have the same amount of missingness) and the same random number seed, but with the variables with equal missingness in a different order, slightly different imputations will be generated. The differences will be purely random and will not produce bias in subsequent parameter estimates. If theboot()option is applied to all variables, the order of variables no longer affects the results.An important application of MI is to investigate possible models, for example prognostic models, in which selection of influential variables is required (Clark & Altman 2003). For example, the stability of the final model across the imputation samples is of interest. This area of enquiry is in its infancy.

See also Van Buuren's website http://www.multiple-imputation.com for further information and software sources.

Examples

. uvis regress y x1 x2 x3, gen(ym)

. uvis logit y x1 x2 x3, gen(y) by(x4) restrict(x5) replace noverbose

. uvis intreg ll ul x1 x2 x3, gen(y)

. ice x1 x2 x3, saving(imputed) m(5)

. ice x1 x2 x3, dropmissing monotone clear m(5)

. ice x1 x2 i.x3, clear m(5)[Note that x3 must have no missing values in the estimation sample]

. ice x1 x2 x3, saving(imputed) m(5) cycles(20) cc(x4 x5)

. ice m.x1 m.x2 o.x3 x4 x5, saving(imputed) m(10) boot(x1 x2 x3) match(x4x5) id(pid) seed(101) genmiss(M_)

. gen x23 = x2 * x3. ice o.x1 x2 x3 x23 z1 z2, saving(imputed) m(5) passive(x23:x2*x3)conditional(z1: if z2==0)

. ice y1 y2 y3 x1 x2 x3 x4, saving(imputed) m(5) eq(y1:x1 x2 y2, y2:y1 x3x4, y3:y1 y2) match(y3)

. ice y1 y2 y3 x1 x2 o.x3 i.x4, saving(imputed) m(5) stepwiseswopts(pe(.10) pr(.15) group(x1 x2, y1 i.x4)lock(y2 x3)) match(x3)

. ice x1-x99, clear debug m(1) cycles(100)

. ice x1 x2 x3, saving(imputed) m(5) cmd(x1:ologit) eqdrop(x2:x3, x1:x2)

. ice x1 x2 x3, saving(imputed) m(5) cmd(x1:ologit) match(x2) dropmissing

. ice x1 ll2 ul2 x2 ll3 ul3 x3, saving(imputed) m(5) interval(x2:ll2 ul2,x3:ll3 ul3)

AuthorPatrick Royston, MRC Clinical Trials Unit, London. pr@ctu.mrc.ac.uk

Further readingvan Buuren S., H. C. Boshuizen and D. L. Knook. 1999. Multiple imputation of missing blood pressure covariates in survival analysis.

Statistics in Medicine18:681-694. Also see http://www.multiple-imputation.com.Carlin J. B., N. Li, P. Greenwood, and C. Coffey. 2003. Tools for analyzing multiple imputed datasets.

Stata Journal3(3):226-244.Clark T. G. and D. G. Altman. 2003. Developing a prognostic model in the presence of missing data: an ovarian cancer case-study.

Journal ofClinical Epidemiology5628-37.Royston P. 2001. The lognormal distribution as a model for survival time in cancer, with an emphasis on prognostic factors.

StatisticaNeelandica55:89-104.Royston P. 2004. Multiple imputation of missing values.

Stata Journal4(3):227-241.Royston P. 2005a. Multiple imputation of missing values: update. Stata Journal

5: 188-201.Royston P. 2005b. Multiple imputation of missing values: update of

ice. Stata Journal5: 527-536.Royston P. 2007. Multiple imputation of missing values: further update of ice, with an emphasis on interval censoring. Stata Journal

7: 445-464.White I. R. and P. Royston. 2009. Imputing missing covariate values for the Cox model. Statistics in Medicine

28: 1982-1998.White I. R., R. Daniel and P. Royston. 2010. Avoiding bias due to perfect prediction in multiple imputation of incomplete categorical variables. Computational Statistics and Data Analysis

54: 2267-2275.

AcknowledgementsIan White has made substantial contributions to the understanding and practical use of multiple imputation, and to the programming of

iceanduvis. Ian wrote the guts of thedraw()option; the idea and code for coping with perfect prediction are essentially all his. I am extremely grateful to him for his ongoing commitment to this project.I am grateful also to Gillian Raab for pointing out certain issues with the prediction matching approach, particularly that it is only useful with continuous variables. As a result, the default imputation method has been changed from matching to drawing from the predictive distribution. Gillian also suggested imputing the variables in reverse order of the amount of missingness, and selecting the imputed value at random from the set determined by the available matching predictions. Both suggestions have been implemented.

Also seeOn-line: help for mim (if installed), mi ice (if installed, Stata 11