help for ice, uvis Patrick Royston -------------------------------------------------------------------------------

Multiple imputation by the MICE system of chained equations

Syntax

ice [mainvarlist] [if] [in] [weight] [, major_options less_used_options]

uvis cmd {yvar|llvar ulvar} xvars [if] [in] [weight] [, options]

options Description ------------------------------------------------------------------------- ice major_options clear clears the original data from memory and loads the imputed dataset into memory dryrun reports the prediction equations - no imputations are done eq(eqlist) defines customised prediction equations m(#) defines the number of imputations match(varlist) prediction matching for each member of varlist passive(passivelist) passive imputation saving(filename [,replace]) imputed and non-imputed variables are stored to filename stepwise constructs prediction equations by stepwise variable selection swopts(stepwise_options) options for stepwise

ice stepwise_options forward perform forward-stepwise selection group(group_list) create groups of variables for joint testing for addition or removal lock(varlist) Variables to be kept in all models pe(#) significance level for addition to a model pr(#) significance level for removal from a model show show each stepwise regression

ice less_used_options allmissing imputes in observations with all values in mainvarlist missing boot(varlist) estimates regression coefficients for varlist in a bootstrap sample by(varlist) imputation within the levels implied by varlist cc(varlist) prevents imputation of missing data in observations in which varlist has a missing value cmd(cmdlist) defines regression command(s) to be used for imputation conditional(condlist) conditional imputation cycles(#) determines number of cycles of regression switching debug assistance to debug individual regressions dropmissing omits from the output all observations not in the estimation sample eqdrop(eqdroplist) removes variables from prediction equations genmiss(string) creates missingness indicator variable(s) id(varname) creates varname containing the original sort order of the data initialonly impute by random sampling from distribution of non-missing values interval(intlist) imputes interval-censored variables matchpool(#) size of pool of potential matches for prediction mean matching monotone assumes pattern of missingness is monotone, and creates relevant prediction equations noconstant suppresses the regression constant nopp suppresses special treatment of perfect prediction noshoweq suppresses presentation of prediction equations noverbose suppresses messages showing the progress of the imputations nowarning suppresses warning messages on(varlist) imputes each member of mainvarlist univariately orderasis enters the variables in the order given persist ignore errors when trying to impute "difficult" variables and/or models restrict([varname] [if]) fit models on a specified subsample, impute missing data for entire estimation sample seed(#) sets random number seed substitute(sublist) substitutes dummy variables for multilevel categorical variables trace(trace_filename) monitors convergence of the imputation algorithm

uvis options gen(newvarname) creates variable containing imputations. Not optional boot estimates regression coefficients in a bootstrap sample by(varlist) imputation within the levels implied by varlist match does prediction mean matching matchpool(#) size of pool of potential matches for prediction mean matching nopp suppresses special treatment of perfect prediction noverbose suppresses information about the imputation process replace overwrites newvarname if it exists restrict([varname] [if]) fit models on a specified subsample, impute missing data for entire estimation sample seed(#) sets random number seed -------------------------------------------------------------------------

where cmd (with uvis) may be intreg, logistic, logit, mlogit, nbreg, ologit, or regress. llvar ulvar are required with intreg.

An element of mainvarlist for ice takes one of two forms: varname or [i.|m.|o.]varname. Details are given in Special features for imputing categorical variables. If mainvarlist is omitted, variables and chained equations are input from special global macros; see the eq() and stepwise options for details.

All weight-types are supported.

Stata 11 users: Please see mi ice, which does all that ice does and a little bit more, and is conveniently integrated into the new mi system.

Description

ice imputes missing values in mainvarlist by using switching regression, an iterative multivariable regression technique. The abbreviation MICE means multiple imputation by chained equations, and was apparently coined by Stef van Buuren. ice implements MICE for Stata. Sets of imputed and non-imputed variables are stored to a new file called filename. Any number of complete imputations may be created. The original data are stored in filename as "imputation number 0" and the new variable _mj is set to 0 for these observations.

uvis (univariate imputation sampling) imputes missing values in the single variable yvar based on multiple regression on xvars. uvis is called repeatedly by ice in a regression switching mode to perform multivariate imputation.

The missing observations are assumed to be "missing at random" (MAR) or "missing completely at random" (MCAR), according to the jargon. See for example van Buuren et al (1999) for an explanation of these concepts.

Please note that ice and uvis require Stata 8.0 or higher. There have been incompatibility issues with Stata 7 or lower.

Special features for imputing categorical variables

The prefixes i., m. and o. for a variable in ice's mainvarlist are a convenience feature designed to simplify specification of the imputation model for categorical variables with three or more levels. You should hardly ever need to use Stata's xi dummy variable and interaction creator directly with ice commands, since dummy variables and more are adequately handled by using the i., m. and o. prefixes.

The prefix i. in i.varname may be used only when varname has no missing data. It applies xi to i.varname to create the corresponding dummy variables. If varname has missing data, imputation is required; either the m. or the o. prefix (see below) should be used with such variables. See Pitfalls in using the i. prefix for further information.

Use of m.varname or o.varname substitutes i. for m. or o. and applies xi: to i.varname, at the same time telling ice to impute missing values of varname using the mlogit or ologit commands, respectively. Use of the m. or o. prefixes also ensures that the corresponding dummy variables are used as predictors in imputation models for other variables (see substitute()) and are 'passively' imputed (see passive()). Suppose that x is a multilevel categorical variable. Then ice o.x varlist, options is expanded to xi: ice x i.x varlist, substitute(x:i.x) cmd(x:ologit) options. Similary, ice m.x varlist, options is expanded to xi: ice x i.x varlist, substitute(x:i.x) cmd(x:mlogit) options.

The resulting 'expanded' version of the ice command is stored in the $F9 global macro. It can be retrieved if desired by pressing the F9 key.

Note that the i., m. and o. prefixes are also valid with binary variables, although much less likely to be useful since one would not wish to impute a binary variable using either mlogit or ologit.

Options

+---------------------+ ----+ ice (major options) +----------------------------------------------

clear clears the original data from memory and loads the imputed dataset. Unless the saving() option is also specified, the data in memory are not permanently saved; this must then be done manually using the save or saveold commands.

dryrun causes ice to report the prediction equations it has constructed from the various inputs, but no imputations are done and no files are created. The option name ("dryrun") may be abbreviated as dry. It is not mandatory to specify an output file with saving(filename) for a dry run. Sometimes the prediction equation set-up needs to be carefully checked before running what may be a lengthy imputation process. Note that stepwise selection of prediction equations (stepwise option) still works when dryrun has been specified.

eq(eqlist) allows one to define prediction equations for any subset of variables in mainvarlist. The eq() option, particularly when used with passive(), allows great flexibility in the possible imputation schemes. Note that eq() takes precedence over all default definitions and assumptions about the way a given variable in mainvarlist is to be imputed. If the passive() and substitute() options are not invoked, the default set of equations is that each variable in mainvarlist with any missing data is imputed from all other variables in mainvarlist.

When eq() is specified, the syntax of eqlist is varname1:varlist1 [,varname2:varlist2 ...] where each varname# (or varlist#) is a member (or subset) of mainvarlist. Variable names prefixed by i. are allowed, provided that the names were prefixed by i., m. or o. in mainvarlist. They are translated to the corresponding dummy variables created by xi:.

A 'blank' (null, constant-only) equation is specified as _cons, for example, eq(x4 x5:_cons). Such equations are reported in the table of prediction equations as "[Empty equation]". The prediction model for variables with empty equations is simply _cons.

If mainvarlist is omitted, ice takes mainvarlist from the global macro $ice_main and the equations, regression commands and predicted variables from global macros $ice_eq#, $ice_cmd# and $ice_x#, respectively, for # = 1, ..., $ice_neq. The number of equations is stored in $ice_neq. These macros are created automatically when ice's stepwise option is used (see details under stepwise). They may also be user-defined. The macros may be inspected in Stata by using the command macro list ice_*.

m(#) defines # as the number of imputations required (minimum 1, no upper limit). The default # is 1.

match[(varlist)] instructs that each member of varlist be imputed with the match option of uvis. This provides prediction matching for each member of varlist. If (varlist) is omitted then all relevant variables are imputed with the match option of uvis. The default, if match() is not specified, is to draw from the posterior predictive distribution of each variable requiring imputation.

passive(passivelist) allows the use of "passive" imputation of variables that depend on other variables, some of which are imputed. The syntax of passivelist is varname:exp [\varname:exp ...]. Notice the requirement to use "\" as a separator between items in passivelist, rather than the usual comma; the reason is that a comma may be a valid part of an expression. The option is most easily explained by example. Suppose x1 is a categorical variable with 3 levels, and that two dummy variables x1a, x1b have been created by the commands

. generate byte x1a=(x1==2) . generate byte x1b=(x1==3)

Now suppose that x1 is to be imputed by the mlogit command, and is to be treated as the two dummy variables x1a and x1b when predicting other variables. Use of mlogit is achieved by the option cmd(x1:mlogit). When x1 is imputed, we want x1a and x1b to be updated with new values which depend on the imputed values of x1. This may be achieved by specifying passive(x1a:x1==2 \ x1b:x1==3). It is necessary also to remove x1 from the list of predictors when variables other than x1 are being imputed, and this is done by using the substitute() option; in the present example, you would specify substitute(x1:x1a x1b).

Note that although in this example x1a will take the (possibly unintended) value of 0 when x1 is missing, ice is careful to ensure that x1a (and x1b) inherit the missingness of x1, and are passively imputed following active imputation of missing values of x1. If this were not done, incorrect results could occur. The responsibility of the user is to create x1a and x1b before running ice such that their missing values are identical to those of x1.

A second example is multiplicative interactions between variables, for example, between x1 and x2 (e.g. x12=x1*x2); this could be entered as passive(x12:x1*x2). It would cause the interaction term x12 to be omitted when either x1 or x2 was being imputed, since it would make no sense to impute x1 from its interaction with x2. substitute() is not needed here.

It should be stressed that variables to be imputed passively must already exist and must be included in mainvarlist, otherwise they are not recognised. Passive variables may be defined in terms of variables in mainvarlist and variables not in mainvarlist, although it would of course make no sense not to involve at least one variable in mainvarlist.

saving(filename [,replace]) saves the imputation to filename. replace allows filename to be overwritten with new data. replace may not be abbreviated.

stepwise constructs prediction equations by stepwise variable selection among members of mainvarlist. There are 3 steps to the process. First, ice creates a dataset with 1 imputation using a randomly drawn subset of values from the distribution of each variable with missing values. (This is the standard initialisation step for ice, and is invoked automatically by the initialonly option.) Next, ice runs stepwise to select variables for each prediction equation. Binary dummy variables are treated appropriately. By default, forward selection at a 5% significance level is used; see the swopts() option for other possibilities. Finally, ice retrieves the reduced equations and performs imputation with them as usual.

Using stepwise also causes ice to store mainvarlist, the selected equations, variables and commands in global macros called $ice_*, as described under the eq() option.

swopts(stepwise_options) allows the following stepwise_options for use with stepwise: forward, group(group_list), lock(varlist), pe(#), pr(#) and show. Note that only pe(#), pr(#) and forward are standard options of Stata's stepwise command; the remainder are used to group variables for joint testing for inclusion or exclusion from the models, to construct a list of variables formatted for use with stepwise's lockterm1 option, and to show the output from stepwise. Further details of individual options are given below under ice (stepwise options).

Specifying neither pe(#) nor pr(#) is equivalent to specifying pe(0.05), i.e. the default method is forward selection of variables significant at the 5% level.

Note that variables in mainvarlist that have the prefix i., indicating that they are categorical, are to be represented by their dummy variables and have no missing data, should retain their i. prefix when they are included in the group() or lock() options.

+------------------------+ ----+ ice (stepwise options) +-------------------------------------------

forward specifies the forward-stepwise method and may be specified only when both pr() and pe() are also specified. Specifying both pr() and pe() without forward results in backward-stepwise selection. Specifying only pr() results in backward selection, and specifying only pe() results in forward selection.

group(group_list) specifies variables always to be tested jointly for inclusion or exclusion from models. An element of group_list is a varlist, and elements are separated by commas, for example group(x1 i.x2, y1 y2). Such groups of variables (or, in the case of categorical variables prefixed with i., their implied dummy variables) are surrounded by parentheses when presented to stepwise for analysis.

lock(varlist) specifies variables to be kept in all models. Such variables are surrounded by parentheses when presented to stepwise for analysis. The lockterm1 option of stepwise is applied to them.

pe(#) specifies the significance level for addition to the model; terms with p < pe() are eligible for addition.

pr(#) specifies the significance level for removal from the model; terms with p >= pr() are eligible for removal.

show displays the output from stepwise for each regression analysis to develop the prediction equations used by ice.

+-------------------------+ ----+ ice (less used options) +------------------------------------------

allmissing imputes missing values in observations in which all variables in mainvarlist are missing. The default is to leave such values as missing.

boot[(varlist)] instructs that each member of varlist, a subset of mainvarlist, be imputed with the boot option of uvis activated. If (varlist) is omitted then all members of mainvarlist with missing observations are imputed using the boot option of uvis.

by(varlist) performs multiple imputation separately for all combinations of variables in varlist. Observations with missing values for any members of varlist are excluded. May be combined with restrict().

cc(varlist) prevents imputation of missing data in mainvarlist for cases in which any member of varlist has a missing value. "cc" signifies "complete case". Note that members of varlist are used for imputation if they appear in mainvarlist, but not otherwise. Use of this option is equivalent to entering if ~missing(var1) & ~missing(var2) ..., where var1, var2, ... denote the members of varlist.

cmd(cmdlist) defines the regression commands to be used for each variable in mainvarlist, when it becomes the dependent variable in the switching regression procedure used by uvis (see Algorithm used by uvis). The first item in cmdlist may be a command such as regress or may have the syntax varlist:cmd, specifying that command cmd applies to all the variables in varlist. Subsequent items in cmdlist must follow the latter syntax, and each item should be followed by a comma.

The default cmd for a variable is logit when there are two distinct values, mlogit when there ar 3-5 and regress otherwise.

Example: cmd(regress) specifies that all variables are to be imputed by regress, over-riding the defaults

Example: cmd(x1 x2:logit, x3:regress) specifies that x1 and x2 are to be imputed by logit, x3 by regress and all others by their default choices

Advanced use: If a cmd is implicitly defined for a variable by a o. or m. prefix and the cmd() option is used explicitly for that same variable then the explicit use takes precedence over the implicit use. For example, the combination ... o.x1, cmd(x1:regress) would impute x1 with regress rather than with the implicit ologit. Used with match(x1), this would give a reasonable alternative to ordinal logistic regression for imputing an ordered categorical variable x1.

conditional(condlist) invokes conditional imputation. Each item of condlist has the form varlist: condition. Items are separated by backslash (\). The idea is that members of varlist are only informative when condition is true, and that they take some pre-determined value when condition is false.

Important: This option was not correctly implemented in versions of ice_ before 1.2.2 � use which ice_ to check your version.

Conditional imputation requires that (i) when any variable included in condition is missing, all variables in varlist are missing, and (ii) when condition is false, each variable in varlist takes only one value (the pre-determined value, which might be 0 or a unique "not-applicable" code such as 99).

In detail, members of varlist are imputed in the usual way for the subset of observations for which if condition is true (i.e. condition evaluates to a non-zero quantity). For the subset of observations for which if condition is false, the pre-determined value is identified from the data for each member of varlist and is used to impute any missing values for that variable. An example is given below.

condition is a Stata expression constructed so that if condition can be evaluated for the current dataset. Variables appearing in condition may be members of mainvarlist or merely variables in the dataset. The only other situation in ice in which variables that do not appear in mainvarlist may be used is described under the passive() option.

Consider a simple example, a dataset comprising three incomplete variables age, female, and pregnant, where female is 1 for females, 0 for males, and pregnant is 1 for pregnant, 0 for not pregnant. Since males can't be pregnant, we wish to impute missing values of pregnant using only data from females. If we impute someone with missing gender as male, we want their pregnancy status always to be imputed as non-pregnant. If males are simply coded as non-pregnant then the pre-determined value is the value of pregnant denoting non-pregnant, i.e. 0; if instead males are coded as pregnant=99 then the pre-determined value is 99. In either case, we implement the conditional imputation as follows:

. ice age pregnant female, conditional(pregnant: female==1) clear

Here, the prediction equation for age is pregnant female, that for female is age and that for pregnant is age if female==1. Observations of pregnant for originally missing observations of female now imputed as male (i.e. female = 0) are assigned the value 0 by ice.

We can have dependent conditional imputation. For example, suppose a fertility test fertile, taking the value 1 for fertile and 0 for infertile, was available just for females. We might code this as follows:

. ice age pregnant female fertile, conditional(pregnant: female==1 & fertile==1 \ fertile: female==1) clear

which reflects that only fertile females can become pregnant, and only females have a fertility test.

cycles(#) determines the number of cycles of regression switching to be carried out. Default # is 10.

debug provides assistance for debugging individual regressions. As ice runs, it prints out, for each imputation and cycle, the name of the regression command, the variable being imputed and R2, the explained variation of the model (Nagelkerke method). At the same time, the values from the last cycle only are stored in a new file called _ice_debug.dta, in the current working directory. A plot of R2 against cycle number may indicate abnormalities; for example if R2 shows instability, the corresponding model may have some features that need improving. The option is useful also for detecting regression models that explain a negligible amount of variation; such models are candidates for deletion.

Because only the final cycle is stored, for debugging purposes it may be most sensible to use the debug option with, say, cycles(100) and m(1).

dropmissing is a feature designed to save memory when using the file of imputed data created by ice. It omits from filename all observations which are not in the estimation sample, that is for which either (i) they are filtered out by if or in, or a non-positive weight, or (ii) the values of all variables in mainvarlist are missing. This option provides a "clean" analysis file of imputations, with no missing values. Note that the observations not in the estimation sample are omitted also from the original data, stored as imputation #0 in filename.

eqdrop(eqdroplist) deletes variables from prediction equations. The syntax of eqdroplist is varname1:varlist1 [,varname2:varlist2 ...] where each varname# (or varlist#) is a member (or subset) of mainvarlist. One can only remove predictors from equations for variables with missing values (although trying to remove predictors from non-existent equations is not a fatal error - an information message is issued). Variable names prefixed by i. are allowed, provided that the names were prefixed by i., m. or o. in mainvarlist. They are translated to the corresponding dummy variables created by xi:.

genmiss(string) creates an indicator variable for the missingness of data in any variable in mainvarlist for which at least one value has been imputed. The indicator variable is set to missing for observations excluded by if, in, etc. The indicator variable for xvar is named stringxvar. The information on missingness is implicit in the original data, which is stored as "imputation 0".

id(newvarname) creates a variable called newvarname containing the original sort order of the data. Default newvarname: _mi.

interval(intlist) imputes interval-censored variables. An interval-censored value is one which is known to lie in an interval [a,b] where a and b are finite and a <= b, or in (-infinity,b] or in [a,infinity). When either terminal is infinite we have left or right censoring, respectively. intlist has the syntax varname:llvar ulvar [, varname:llvar ulvar ...], where each varname is an interval-censored variable, each llvar contains the lower bound (a) for varname and each ulvar contains the upper bound (b) for varname (or a missing value to represent plus or minus infinity). The supplied values of varname are irrelevant since they will be replaced anyway; it is only required that varname exist. Observations with llvar missing and ulvar present are left-censored for varname. Observations with llvar present and ulvar missing are right-censored for varname. Observations with llvar = ulvar are complete, and no imputation is done for them. Observations with both llvar and ulvar missing are imputed assuming an uncensored normal distribution. See Interval censoring for further information.

initialonly imputes by random sampling from the distribution of the non-missing values of each variable which has missing value(s). This is the initialisation step of the MICE algorithm (see Remarks). This option may be used to get a 'quick and dirty' set of multiple imputations with which to explore initial impressions of the analysis model, or to investigate possible prediction equations for subsequent multiple imputation using the MICE method. The prediction equations that are displayed are the ones that would be used by default in a full MICE imputation run; with the initialonly option, they are ignored when imputations are produced.

matchpool(#) modifies the implementation of the match() option. match performs predictive mean matching in which a pool of potential matches is constructed and one member of this pool is sampled (with equal probabilities). # specifies the size of this pool. The default is 3. Please note that older versions of ice used # = 1.

monotone assumes the members of mainvarlist have a monotone missingness pattern, that is, ice defines the prediction equations appropriately. For variables x1, ..., xk the imputation equations would be x1 on [nothing], x2 on x1, x3 on x1 x2, ... , xk on x1 x2 ... x(k-1). When the missingness really is monotonic, only one cycle of MICE is required, so the default here is cycles(1). There is no advantage in specifying more than one cycle.

With the monotone option, ice reports a 'non-monotonicity score'. This is defined as 100 * (sum of numerators) / (sum of denominators), where the sums are taken over all adjacent pairs of variables in mainvarlist. Consider two variables, x1 and x2. The numerator for x1 and x2, i.e the non-monotonicity, is the number of observations in the estimation sample for which x1 is missing and x2 is observed. If the numerator is positive, x1 and x2 show a non-monotonic pattern. The denominator for x1 and x2 is the the number of observations in the estimation sample for which x2 is observed.

ice takes a relaxed view of runs in which the non-monotonicity score is positive. It warns the user but goes ahead with the imputation anyway - it assumes that the user knows what they are doing.

noshoweq suppresses the presentation of the prediction equations.

noconstant suppresses the regression constant in all regressions.

nopp suppresses treatment of the perfect prediction bug (see Avoiding the perfect prediction bug).

noverbose suppresses display of the imputation number (as #) and cycle number within imputations (as .) which show the progress of the imputations.

nowarning suppresses warning messages.

on(varlist) changes the operation of ice in a major way. With this option, uvis imputes each member of mainvarlist univariately on varlist. This provides a convenient way of producing multiple imputations when imputation for each variable in mainvarlist is to be done univariately on a set of complete predictors.

orderasis enters the variables in mainvarlist into the MICE algorithm in the order given. The default is to order them according to the number of missing values: the variable with least missingness gets imputed first, and so on.

persist causes ice to ignore errors raised by uvis when trying to impute a "difficult" variable, or impute with a model that is difficult to fit to the data to hand. Trying to impute a "difficult" variable using the ologit or mlogit command is the most common cause of failure. By default, ice stops with an error message. With persist, ice continues to the next variable to be imputed, not updating the variable that raised an error. Often, by the play of chance, the "difficult" variable is successfully updated in a subsequent cycle, and no damage is done to the imputation process.

If the error for a given variable appears in every cycle, you should consider changing the prediction equation for that variable, since its imputed values are unlikely to be appropriate.

We do not recommend the routine use of persist. Only use it when it appears that there is sporadic failure to fit an imputation model.

restrict([varname] [if]) specifies that imputation models be computed using the subsample identified by varname and if.

The subsample is defined by the observations for which varname!=0 that also meet the if conditions. Typically, varname=1 defines the subsample and varname=0 indicates observations not belonging to the subsample. For observations whose subsample status is uncertain, varname should be set to a missing value; such observations are dropped from the subsample.

By default ice fits imputation models and imputes missing values using the sample of observations identified in the [if] [in] options. The restrict() option identifies a subset of this sample to be used for model estimation. Imputation is restricted to the sample identified in the [if] [in] options. Thus, predictions and their associated imputations are made 'out-of-sample' with respect to the subsample defined by restrict().

Be careful to avoid restrictions that prevent prediction for all the relevant observations. For example, models that involve mlogit will fail to predict 'everywhere' if the restrict() option excludes any of the levels of the target variable, as in the following example. school is a four-level categorical variable coded 0, 1, 2, 3:

. gen byte ok = (school > 0) if !missing(school) . ice school house age sex bcg, clear restrict(ok)

By default, school is imputed using mlogit. Predictions cannot be made for observations with school==0. ice will halt with error #303 (equation not found).

seed(#) sets the random number seed to #. In order to reproduce a set of imputations, the same random number seed should be used. See Reproducibility of results from uvis and ice for further comments. Default #: 0, meaning no seed is set by the program; depending on the status of Stata's random number seed, different sets of imputations should be obtained on each run.

substitute(sublist) is typically used with the passive() option to represent multilevel categorical variables as dummy variables in models for predicting other variables. See passive() for more details. The syntax of sublist is varname:dummyvarlist [,varname:dummyvarlist ...] where varname is the name of a variable to be substituted and dummyvarlist is the list of dummy variables representing it.

Note, however, the following important convenience feature: substitute() may be used without corresponding expressions in passive() to recreate dummy variables automatically. If the values of variables in dummyvarlist are NOT defined through expressions involving varname in the passive() option, then the variables in dummyvarlist are calculated according to the actual range of values of varname. For example, suppose the options passive(x1a:x1==2 \ x1b:x1==3) and substitute(x1:x1a x1b) were specified. Provided that all the non-missing values of x1 were 2 when x1a==1 and all the non-missing values of x1 were 3 when x1b==1, then passive(x1a:x1==2 \ x1b:x1==3) is implied by substitute(x1:x1a x1b) and can be omitted. The rule applied by substitute(x:dummy1 [dummy2...]) for defining dummy variables dummy1, dummy2, ... is as follows:

1. Determine the range of values [xmin, xmax] of x for which dummy1 > 0.

2a. If xmin < xmax, define dummy1 to be 1 if xmin <= x <= xmax and 0 otherwise.

2b. If xmin = xmax, define dummy1 to be 1 if x = xmin and 0 otherwise.

3. Repeat steps 1 and 2a,b for dummy2, dummy3, ... as necessary.

With many such categorical variables this feature can save a lot of typing.

trace(trace_filename) monitors the convergence of the imputation algorithm. For each original variable with missing values, the mean of the imputed values is stored as a variable in trace_filename, together with the cycle number at which that mean was calculated. The results are stored only for the final imputation. For diagnostic purposes, it is sensible to run trace() with m(1) and a large number of cycles, such as cycles(100). When the run is complete, it is helpful to load trace_filename into memory and plot the mean for each imputed variable against the cycle number. If necessary, smoothing may be applied to clarify any apparent pattern. Convergence is judged to have occurred when the pattern of the imputed means is random. It is usually obvious from the appearance of the plot how many cycles are needed for convergence.

+------+ ----+ uvis +-------------------------------------------------------------

boot invokes a bootstrap method for creating imputed values (see bootstrap).

by(varlist) performs imputation separately for all combinations of variables in varlist. Observations with missing values for any members of varlist are excluded. May be combined with restrict().

gen(newvar) is not optional. newvar contains original (non-missing) and imputed (originally missing) values of yvar.

match creates imputations by prediction matching. The default is to draw imputations at random from the posterior distribution of the missing values of yvar, conditional on the observed values and the members of xvars. See match for further details.

matchpool(#) - see matchpool for details.

noconstant suppresses the regression constant in all regressions.

noverbose suppresses non-error messages while uvis is running.

replace permits newvar (see gen(newvar)) to be overwritten with new data. replace may not be abbreviated.

restrict([varname] [if]) specifies that the imputation model be computed using the subsample identified by varname and if.

By default uvis fits the imputation model using the sample of observations identified in the [if] [in] options. The restrict() option identifies a subset of this sample.

seed(#) sets the random number seed to #. See Reproducibility of results from uvis and ice for comments on how to ensure reproducible imputations by using the seed() option. Default #: 0, meaning no seed is set by the program.

Remarks

Algorithm used by uvis

When cmd is regress, uvis imputes yvar from xvars according to the following algorithm (see van Buuren et al (1999) section 3.2 for further technical details):

1. Estimate the vector of coefficients (beta) and the residual variance by regressing the non-missing values of yvar on the current "completed" version of xvars. Predict the fitted values etaobs at the non-missing observations of yvar.

2. Draw at random a value (sigma_star) from the posterior distribution of the residual standard deviation.

3. Draw at random a value (beta_star) from the posterior distribution of beta, conditional on sigma_star, thus allowing for uncertainty in beta.

4. Use beta_star to predict the fitted values etamis at the missing observations of yvar.

5. The imputed values are predicted directly from beta_star, sigma_star and the covariates. For imputation by linear regression, this step assumes that yvar is Normally distributed, given the covariates. For other types of imputation, samples are drawn from the appropriate distribution.

With the match option, step 5 is replaced by the following. For each missing observation of yvar with prediction etamis, find the non-missing observation of yvar whose prediction (etaobs) on observed data is closest to etamis. This closest non-missing observation is used to impute the missing value of yvar.

The default draw method is not robust to departures from Normality and may produce implausible imputations. For example, if the original distribution is skew and positive-valued, the imputed distribution will not necessarily have the appropriate amount of skewness, nor will all the imputed values necessarily be positive. Log transformation of positive variables may greatly improve the appropriateness of the imputations.

The alternative match method is recommended only for continuous variables when the Normality assumption is clearly untenable, even approximately. It is not necessary, nor is it implemented, for binary, ordered categorical or nominal variables. match may work well when the distribution of a continuous variable is very non-Normal, but it may sometimes result in biased imputations.

With the boot option, steps 2-4 are replaced by a bootstrap estimation of beta_star and sigma_star, obtained by regressing yvar on xvars after taking a bootstrap sample of the non-missing observations. This has the advantage of robustness since the distribution of beta is no longer assumed to be multivariate normal.

Note that uvis will not impute observations for which a value of a variable in xvars is missing. However, all original (missing or non-missing) observations of yvar will be copied into newvarname in such cases. This is a change from the first release of uvis (with mvis). Previously, newvarname would be set to missing whenever a value of a variable in xvars was missing, irrespective of the value of yvar.

Missing data for ordered (or unordered) categorical covariates should be imputed by using the ologit (or mlogit) command. match is neither required nor implemented in these cases.

ice carries out multivariate imputation in mainvarlist using regression switching (van Buuren et al 1999) as follows:

1. Ignore any observations for which mainvarlist has only missing values, or if the cc(varlist) option has been specified, for which any member of varlist has a missing value.

2. For each variable in mainvarlist with any missing data, randomly order that variable and replicate the observed values across the missing cases. This step initialises the iterative procedure by ensuing that no relevant values are missing.

3. For each variable in mainvarlist in turn, impute missing values by applying uvis with the remaining variables as covariates.

4. Repeat step 3 cycles() times, replacing the imputed values with updated values at the end of each cycle.

A single imputation sample is created for each variable with any relevant missing values.

Van Buuren recommends cycles(20) but goes on to say that 10 or even 5 iterations are probably sufficient. We have chosen a compromise default of 10.

"Multiple imputation" (MI) implies the creation and analysis of several imputed datasets. To do this, one would run ice with m set to a suitable number, for example 5. To obtain final estimates of the parameters of interest and their standard errors, one would fit a model in each imputation and carry out the appropriate post-MI averaging procedure on the results from the m separate imputations. A suitable estimation tool for this purpose is mim.

Handling the outcome variable

To avoid bias, the outcome variable must always be included in the list of variables to be used for imputation. In survival analysis, in particular, it is essential to include the censoring indicator as well as the survival time. van Buuren et al (1999) recommend a log transformation of the survival time, apparently a heuristic choice. We have shown (White & Royston 2008) that for a single binary predictor and a proportional hazards analysis model, the correct imputation model comprises the baseline cumulative hazard, the censoring indicator and the binary predictor. The theory remains approximately valid for a normally distributed predictor with a weak effect. More complex cases have not yet been investigated, but at least some guidance is now available.

Handling binary variables

Binary variables present no difficulty. By default, in the MICE procedure, when such a variable is the response, it is predicted from other variables by using logistic regression; when it is a covariate, it is modelled in the only way possible, effectively as a single dummy variable.

Ensure that binary variables are coded 0/1. Although, in theory, one could use ologit or mlogit to model them, in practice there is no advantage in doing so. Furthermore, do not use the i. prefix with binary variables, since there is a speed penalty in doing so.

Handling categorical variables

Categorical variables with 3 or more levels may in principle be treated in different ways. By default, in ice variables with 3-5 levels are modelled using multinomial logistic regression (mlogit command) when the response, and as a single linear term when a covariate. The same behaviour occurs with the ordered logistic model (ologit command). Our recommended strategy is to use the m. or o. prefixes for variables to be imputed using unordered or ordered logistic regression. This approach removes the need to define the substitute() and passive() options, both of which can be tedious and error-prone to type.

You should be aware that unless the dataset is large, use of the mlogit command may produce unstable estimates if the number of levels is too large, and may compromise the accuracy of the imputations. It is hard to predict when this will occur.

Interval censoring

Values of a variable y that are interval censored are imputed under the assumption that y is normally distributed with unknown mean and variance. The method, which is fast and efficient, is essentially as described for right-censored variables in section 3.3 of Royston (2001). A minor extension to allow left or interval censoring is employed. For example, if A < y < B and A and B are both finite, the imputed value for y will follow a truncated normal distribution with bounds A and B, variance parameter estimated from the data and mean given by the linear predictor for the imputation model for y. Stata's intreg command is used to estimate the mean and variance of y. When A and B are both missing (infinite), imputation of y simply assumes the normal distribution just mentioned, but without bounds.

If you wish to impose range limits on the imputed values, the lower and upper bound variables may be set accordingly. For example, to impute right-censored (e.g. survival) data, you would set llvar equal to all the observed times to event, whether censored or not, and ulvar to all the uncensored event times and missing for the censored times. This would cause the right-censored values to be imputed without restriction. If you wanted to bound the imputed values above, say by 10, you would specify ulvar to be 10 (rather than missing) for all the censored observations.

Avoiding the perfect prediction bug

Perfect prediction may arise in logistic, ologit or mlogit regression models when a (usually categorical) predictor variable perfectly predicts success or failure in the outcome variable. In ice, perfect prediction may occur without the user's knowledge because a large number of regression models are run silently. Perfect prediction may lead to entirely inappropriate imputations. To avoid this, uvis checks for perfect prediction; if it is detected, uvis temporarily augments the data with a small number of extra observations with low weight, in such a way as to remove the perfect prediction. A message is displayed noting the variable that has the perfect prediction issue, and that the problem has been dealt with. Such treatment of the perfect prediction bug may be switched off, if desired, by using the nopp option.

Errors and diagnostics

ice may occasionally detect an anomaly when running uvis with a particular variable as response and a particular regression command. ice will then stop and report the uvis command it was running and the error number returned. Also, ice saves to a file called _ice_dump.dta in the working directory a snapshot of the data it was using when the error occurred, while also reporting the uvis command it was executing. Sometimes the problem lies in a regression of a binary or categorical variable where the estimation procedure fails to converge; this is usually caused by sparse cell occupancy of the response variable. If you obtain this error you should either omit the offending variable from the imputation, or seek to combine a sparse category with another category.

Another possibility is that, again due to a defect in a particular regression command in the chained equations structure, the number of values imputed for a particular variable is less than expected. This is a serious error and again may arise from estimation problems involving a binary or categorical variable. In this situation, ice again saves to a file called _ice_dump.dta in the working directory a snapshot of the data it was using in the attempted estimation, while reporting the uvis command it was executing. You can then investigate what may have gone wrong with the command by loading the data in _ice_dump.dta and re-running the offending regression command.

Reproducibility of results from uvis and ice

Use of the option seed(#) ensures that a set of imputed values is reproduced identically for a given value of #. This is true for both uvis and ice.

Please report to the author any instances where use of ice or uvis with a fixed seed does not produce the same set of imputed values.

Pitfalls in using the i. prefix

ice commands that include i.varname in mainvarlist need to be handled with awareness. If varname has no missing data in the estimation sample, expected results are obtained. If varname does have missing values in the estimation sample, an error message is given and ice stops. The "estimation sample" here is the set of observations for which at least one variable in mainvarlist has non-missing value(s).

The presence of i. evokes xi, which expands i.varname in the usual way to create _Ivarname_# dummy variables. Since varname has no missing data, the dummy variables are included in the prediction equations for other variables in mainvarlist, as required.

If i.varname were allowed to have missing data in the estimation sample, xi expansion would occur as before, but each of the _Ivarname_# dummy variables would become a response variable in a prediction equation and would be predicted individually (using logistic regression). Worse, the prediction equation for each dummy variable would include the other dummy variables from i.varname. That is clearly nonsense.

The advice, as always, is (a) to use dryrun before 'production' runs if the ice command is at all complex, and then (b) carefully to check that ice's table of prediction equations is both sensible and what you expected.

Further notes

ice saves all the variables in the current data to the output, whether or not they are involved in the imputation procedure. This can make the resulting dataset very large. It may therefore be sensible to drop variables not subsequently needed for modelling before running ice.

ice determines the order of imputing variables in the cycle of chained equations according to the amount of missing data. Variables with the least missingness are imputed first. Variables with the same amount of missingness are processed in an arbitrary order, but always in the same order. Note that if ice is run twice using identical variables (at least two of which have the same amount of missingness) and the same random number seed, but with the variables with equal missingness in a different order, slightly different imputations will be generated. The differences will be purely random and will not produce bias in subsequent parameter estimates. If the boot() option is applied to all variables, the order of variables no longer affects the results.

An important application of MI is to investigate possible models, for example prognostic models, in which selection of influential variables is required (Clark & Altman 2003). For example, the stability of the final model across the imputation samples is of interest. This area of enquiry is in its infancy.

See also Van Buuren's website http://www.multiple-imputation.com for further information and software sources.

Examples

. uvis regress y x1 x2 x3, gen(ym)

. uvis logit y x1 x2 x3, gen(y) by(x4) restrict(x5) replace noverbose

. uvis intreg ll ul x1 x2 x3, gen(y)

. ice x1 x2 x3, saving(imputed) m(5)

. ice x1 x2 x3, dropmissing monotone clear m(5)

. ice x1 x2 i.x3, clear m(5) [Note that x3 must have no missing values in the estimation sample]

. ice x1 x2 x3, saving(imputed) m(5) cycles(20) cc(x4 x5)

. ice m.x1 m.x2 o.x3 x4 x5, saving(imputed) m(10) boot(x1 x2 x3) match(x4 x5) id(pid) seed(101) genmiss(M_)

. gen x23 = x2 * x3 . ice o.x1 x2 x3 x23 z1 z2, saving(imputed) m(5) passive(x23:x2*x3) conditional(z1: if z2==0)

. ice y1 y2 y3 x1 x2 x3 x4, saving(imputed) m(5) eq(y1:x1 x2 y2, y2:y1 x3 x4, y3:y1 y2) match(y3)

. ice y1 y2 y3 x1 x2 o.x3 i.x4, saving(imputed) m(5) stepwise swopts(pe(.10) pr(.15) group(x1 x2, y1 i.x4)lock(y2 x3)) match(x3)

. ice x1-x99, clear debug m(1) cycles(100)

. ice x1 x2 x3, saving(imputed) m(5) cmd(x1:ologit) eqdrop(x2:x3, x1:x2)

. ice x1 x2 x3, saving(imputed) m(5) cmd(x1:ologit) match(x2) dropmissing

. ice x1 ll2 ul2 x2 ll3 ul3 x3, saving(imputed) m(5) interval(x2:ll2 ul2, x3:ll3 ul3)

Author

Patrick Royston, MRC Clinical Trials Unit, London. pr@ctu.mrc.ac.uk

Further reading

van Buuren S., H. C. Boshuizen and D. L. Knook. 1999. Multiple imputation of missing blood pressure covariates in survival analysis. Statistics in Medicine 18:681-694. Also see http://www.multiple-imputation.com.

Carlin J. B., N. Li, P. Greenwood, and C. Coffey. 2003. Tools for analyzing multiple imputed datasets. Stata Journal 3(3):226-244.

Clark T. G. and D. G. Altman. 2003. Developing a prognostic model in the presence of missing data: an ovarian cancer case-study. Journal of Clinical Epidemiology 5628-37.

Royston P. 2001. The lognormal distribution as a model for survival time in cancer, with an emphasis on prognostic factors. Statistica Neelandica 55:89-104.

Royston P. 2004. Multiple imputation of missing values. Stata Journal 4(3):227-241.

Royston P. 2005a. Multiple imputation of missing values: update. Stata Journal 5: 188-201.

Royston P. 2005b. Multiple imputation of missing values: update of ice. Stata Journal 5: 527-536.

Royston P. 2007. Multiple imputation of missing values: further update of ice, with an emphasis on interval censoring. Stata Journal 7: 445-464.

White I. R. and P. Royston. 2009. Imputing missing covariate values for the Cox model. Statistics in Medicine 28: 1982-1998.

White I. R., R. Daniel and P. Royston. 2010. Avoiding bias due to perfect prediction in multiple imputation of incomplete categorical variables. Computational Statistics and Data Analysis 54: 2267-2275.

Acknowledgements

Ian White has made substantial contributions to the understanding and practical use of multiple imputation, and to the programming of ice and uvis. Ian wrote the guts of the draw() option; the idea and code for coping with perfect prediction are essentially all his. I am extremely grateful to him for his ongoing commitment to this project.

I am grateful also to Gillian Raab for pointing out certain issues with the prediction matching approach, particularly that it is only useful with continuous variables. As a result, the default imputation method has been changed from matching to drawing from the predictive distribution. Gillian also suggested imputing the variables in reverse order of the amount of missingness, and selecting the imputed value at random from the set determined by the available matching predictions. Both suggestions have been implemented.

Also see

On-line: help for mim (if installed), mi ice (if installed, Stata 11