{smcl} {* 18feb2014}{...} {cmd:help for ice, uvis}{right:Patrick Royston} {hline} {title:Title} {p2colset 5 12 14 2}{...} {p2col :{hi:ice} {hline 2}}Multiple imputation by the MICE system of chained equations{p_end} {p2colreset}{...} {title:Syntax} {phang2} {cmd:ice} [{it:mainvarlist}] {ifin} {weight} [{cmd:,} {it:major_options less_used_options}] {phang2} {cmd:uvis} {it:cmd} {{it:yvar}|{it:llvar ulvar}} {it:xvars} {ifin} {weight} [{cmd:,} {it:options}] {synoptset 28 tabbed}{...} {synopthdr} {synoptline} {syntab :{cmd:ice} {it:major_options}} {synopt :{opt clear}}clears the original data from memory and loads the imputed dataset into memory{p_end} {synopt :{opt dry:run}}reports the prediction equations - no imputations are done{p_end} {synopt :{opt eq(eqlist)}}defines customised prediction equations{p_end} {synopt :{opt m(#)}}defines the number of imputations{p_end} {synopt :{opt ma:tch(varlist)}}predictive mean matching for each member of {it:varlist}{p_end} {synopt :{opt pass:ive(passivelist)}}passive imputation{p_end} {synopt :{cmdab:sav:ing(}{it:filename} [{opt ,replace}]{cmd:)}}imputed and non-imputed variables are stored to {it:filename}{p_end} {synopt :{opt stepwise}}constructs prediction equations by stepwise variable selection{p_end} {synopt :{opt sw:opts(stepwise_options)}}options for {opt stepwise}{p_end} {syntab :{cmd:ice} {it:stepwise_options}} {synopt :{opt forward}}perform forward-stepwise selection{p_end} {synopt :{opt gr:oup(group_list)}}create groups of variables for joint testing for addition or removal{p_end} {synopt :{opt lo:ck(varlist)}}Variables to be kept in all models{p_end} {synopt :{opt pe(#)}}significance level for addition to a model{p_end} {synopt :{opt pr(#)}}significance level for removal from a model{p_end} {synopt :{opt sh:ow}}show each stepwise regression{p_end} {syntab :{cmd:ice} {it:less_used_options}} {synopt :{opt allm:issing}}imputes in observations with all values in {it:mainvarlist} missing{p_end} {synopt :{opt bo:ot(varlist)}}estimates regression coefficients for {it:varlist} in a bootstrap sample{p_end} {synopt :{opt by(varlist)}}imputation within the levels implied by {it:varlist}{p_end} {synopt :{opt cc(varlist)}}prevents imputation of missing data in observations in which {it:varlist} has a missing value{p_end} {synopt :{opt cm:d(cmdlist)}}defines regression command(s) to be used for imputation{p_end} {synopt :{opt cond:itional(condlist)}}conditional imputation{p_end} {synopt :{opt cy:cles(#)}}determines number of cycles of regression switching{p_end} {synopt :{opt de:bug}}assistance to debug individual regressions{p_end} {synopt :{opt drop:missing}}omits from the output all observations not in the estimation sample{p_end} {synopt :{opt eqd:rop(eqdroplist)}}removes variables from prediction equations{p_end} {synopt :{opt g:enmiss(string)}}creates missingness indicator variable(s){p_end} {synopt :{opt i:d(varname)}}creates {it:varname} containing the original sort order of the data{p_end} {synopt :{opt init:ialonly}}impute by random sampling from distribution of non-missing values{p_end} {synopt :{opt int:erval(intlist)}}imputes interval-censored variables{p_end} {synopt :{opt matchp:ool(#)}}size of pool of potential matches for predictive mean matching{p_end} {synopt :{opt mono:tone}}assumes pattern of missingness is monotone, and creates relevant prediction equations{p_end} {synopt :{opt nocons:tant}}suppresses the regression constant{p_end} {synopt :{opt nopp}}suppresses special treatment of perfect prediction{p_end} {synopt :{opt nosh:oweq}}suppresses presentation of prediction equations{p_end} {synopt :{opt nover:bose}}suppresses messages showing the progress of the imputations{p_end} {synopt :{opt nowarn:ing}}suppresses warning messages{p_end} {synopt :{opt on(varlist)}}imputes each member of {it:mainvarlist} univariately{p_end} {synopt :{opt ord:erasis}}enters the variables in the order given{p_end} {synopt :{opt per:sist}}ignore errors when trying to impute "difficult" variables and/or models{p_end} {synopt :{cmdab:res:trict(}[{varname}] [{it:{help if}}]{cmd:)}}fit models on a specified subsample, impute missing data for entire estimation sample{p_end} {synopt :{opt se:ed(#)}}sets random number seed{p_end} {synopt :{opt sub:stitute(sublist)}}substitutes dummy variables for multilevel categorical variables{p_end} {synopt :{opt tr:ace(trace_filename)}}monitors convergence of the imputation algorithm{p_end} {syntab :{cmd:uvis} {it:options}} {synopt :{opt g:en(newvarname)}}creates variable containing imputations. {opt Not optional}{p_end} {synopt :{opt bo:ot}}estimates regression coefficients in a bootstrap sample{p_end} {synopt :{opt by(varlist)}}imputation within the levels implied by {it:varlist}{p_end} {synopt :{opt lrd}}imputes using local residual draws{p_end} {synopt :{opt ma:tch}}does predictive mean matching{p_end} {synopt :{opt matchp:ool(#)}}size of pool of potential matches for predictive mean matching{p_end} {synopt :{opt matchtype(#)}}sets the method for identifying closest matches{p_end} {synopt :{opt nopp}}suppresses special treatment of perfect prediction{p_end} {synopt :{opt nover:bose}}suppresses information about the imputation process{p_end} {synopt :{opt replace}}overwrites {it:newvarname} if it exists{p_end} {synopt :{cmdab:res:trict(}[{varname}] [{it:{help if}}]{cmd:)}}fit models on a specified subsample, impute missing data for entire estimation sample{p_end} {synopt :{opt se:ed(#)}}sets random number seed{p_end} {synoptline} {p2colreset}{...} {pstd} where {it:cmd} (with {opt uvis}) may be {help intreg}, {help logistic}, {help logit}, {help mlogit}, {help nbreg}, {help ologit}, or {help regress}. {it:llvar} {it:ulvar} are required with {cmd:intreg}. {pstd} An element of {it:mainvarlist} for {cmd:ice} takes one of two forms: {it:varname} or [{hi:i.}|{hi:m.}|{hi:o.}]{it:varname}. Details are given in {help ice##special:Special features for imputing categorical variables}. If {it:mainvarlist} is omitted, variables and chained equations are input from special global macros; see the {cmd:eq()} and {opt stepwise} options for details. {pstd} All weight-types are supported. {pstd} {bf:{ul:Stata 11 users:}} Please see {help mi ice}, which does all that {cmd:ice} does and a little bit more, and is conveniently integrated into the new {help mi} system. {title:Description} {pstd} {cmd:ice} imputes missing values in {it:mainvarlist} by using switching regression, an iterative multivariable regression technique. The abbreviation MICE means multiple imputation by chained equations, and was apparently coined by Stef van Buuren. {cmd:ice} implements MICE for Stata. Sets of imputed and non-imputed variables are stored to a new file called {it:filename}. Any number of complete imputations may be created. The original data are stored in {it:filename} as "imputation number 0" and the new variable {cmd:_mj} is set to 0 for these observations. {pstd} {cmd:uvis} ({cmd:u}ni{cmd:v}ariate {cmd:i}mputation {cmd:s}ampling) imputes missing values in the single variable {it:yvar} based on multiple regression on {it:xvars}. {cmd:uvis} is called repeatedly by {cmd:ice} in a regression switching mode to perform multivariate imputation. {pstd} The missing observations are assumed to be "missing at random" (MAR) or "missing completely at random" (MCAR), according to the jargon. See for example van Buuren {it:et al} (1999) for an explanation of these concepts. {pstd} Please note that {cmd:ice} and {cmd:uvis} require Stata 8.0 or higher. There have been incompatibility issues with Stata 7 or lower. {pstd} {marker special}{...} {ul:{hi:Special features for imputing categorical variables}} {pstd} The prefixes {hi:i.}, {hi:m.} and {hi:o.} for a variable in {cmd:ice}'s {it:mainvarlist} are a convenience feature designed to simplify specification of the imputation model for categorical variables with three or more levels. You should hardly ever need to use Stata's {cmd:xi} dummy variable and interaction creator directly with {cmd:ice} commands, since dummy variables and more are adequately handled by using the {hi:i.}, {hi:m.} and {hi:o.} prefixes. {pstd} The prefix {hi:i.} in {hi:i.}{it:varname} may be used only when {it:varname} has no missing data. It applies {cmd:xi} to {hi:i.}{it:varname} to create the corresponding dummy variables. If {it:varname} has missing data, imputation is required; either the {hi:m.} or the {hi:o.} prefix (see below) should be used with such variables. See {help ice##pitfalls:Pitfalls in using the {hi:i.} prefix} for further information. {pstd} Use of {hi:m.}{it:varname} or {hi:o.}{it:varname} substitutes {hi:i.} for {hi:m.} or {hi:o.} and applies {cmd:xi:} to {hi:i.}{it:varname}, at the same time telling {cmd:ice} to impute missing values of {it:varname} using the {cmd:mlogit} or {cmd:ologit} commands, respectively. Use of the {hi:m.} or {hi:o.} prefixes also ensures that the corresponding dummy variables are used as predictors in imputation models for other variables (see {help ice##substitute:substitute()}) and are 'passively' imputed (see {help ice##passive:passive()}). Suppose that {hi:x} is a multilevel categorical variable. Then {cmd:ice o.x}{it: varlist}{cmd:,}{it: options} is expanded to {cmd:xi: ice x i.x}{it: varlist}{cmd:, substitute(x:i.x) cmd(x:ologit)}{it: options}. Similary, {cmd:ice m.x}{it: varlist}{cmd:,}{it: options} is expanded to {cmd:xi: ice x i.x}{it: varlist}{cmd:, substitute(x:i.x) cmd(x:mlogit)}{it: options}. {pstd} The resulting 'expanded' version of the {cmd:ice} command is stored in the {cmd:$F9} global macro. It can be retrieved if desired by pressing the F9 key. {pstd} Note that the {hi:i.}, {hi:m.} and {hi:o.} prefixes are also valid with binary variables, although much less likely to be useful since one would not wish to impute a binary variable using either {cmd:mlogit} or {cmd:ologit}. {title:Options} {dlgtab:ice (major options)} {phang} {opt clear} clears the original data from memory and loads the imputed dataset. Unless the {opt saving()} option is also specified, the data in memory are not permanently saved; this must then be done manually using the {help save} or {help saveold} commands. {phang} {opt dryrun} causes {cmd:ice} to report the prediction equations it has constructed from the various inputs, but no imputations are done and no files are created. The option name ("dryrun") may be abbreviated as {opt dry}. It is not mandatory to specify an output file with {opt saving(filename)} for a dry run. Sometimes the prediction equation set-up needs to be carefully checked before running what may be a lengthy imputation process. Note that stepwise selection of prediction equations ({opt stepwise} option) still works when {opt dryrun} has been specified. {phang} {marker eq}{opt eq(eqlist)} allows one to define prediction equations for any subset of variables in {it:mainvarlist}. The {opt eq()} option, particularly when used with {cmd:passive()}, allows great flexibility in the possible imputation schemes. Note that {cmd:eq()} takes precedence over all default definitions and assumptions about the way a given variable in {it:mainvarlist} is to be imputed. If the {cmd:passive()} and {cmd:substitute()} options are not invoked, the default set of equations is that each variable in {it:mainvarlist} with any missing data is imputed from all other variables in {it:mainvarlist}. {pmore} When {opt eq()} is specified, the syntax of {it:eqlist} is {it:varname1}{cmd::}{it:varlist1} [{cmd:,}{it:varname2}{cmd::}{it:varlist2} ...] where each {it:varname#} (or {it:varlist#}) is a member (or subset) of {it:mainvarlist}. Variable names prefixed by {cmd:i.} are allowed, provided that the names were prefixed by {hi:i.}, {hi:m.} or {hi:o.} in {it:mainvarlist}. They are translated to the corresponding dummy variables created by {cmd:xi:}. {pmore} A 'blank' (null, constant-only) equation is specified as {cmd:_cons}, for example, {cmd:eq(x4 x5:_cons)}. Such equations are reported in the table of prediction equations as "{cmd:[Empty equation]}". The prediction model for variables with empty equations is simply {cmd:_cons}. {pmore} If {it:mainvarlist} is omitted, {cmd:ice} takes {it:mainvarlist} from the global macro {cmd:$ice_main} and the equations, regression commands and predicted variables from global macros {cmd:$ice_eq}{it:#}, {cmd:$ice_cmd}{it:#} and {cmd:$ice_x}{it:#}, respectively, for {it:#} = 1, ..., {cmd:$ice_neq}. The number of equations is stored in {cmd:$ice_neq}. These macros are created automatically when {cmd:ice}'s {opt stepwise} option is used (see details under {opt stepwise}). They may also be user-defined. The macros may be inspected in Stata by using the command {cmd:macro list ice_*}. {phang} {opt m(#)} defines {it:#} as the number of imputations required (minimum 1, no upper limit). The default {it:#} is 1. {phang} {marker match}{cmd:match}[{cmd:(}{it:varlist}{cmd:)}] instructs that each member of {it:varlist} be imputed with the {cmd:match} option of {cmd:uvis}. This provides predictive mean matching for each member of {it:varlist}. If {cmd:(}{it:varlist}{cmd:)} is omitted then all relevant variables are imputed with the {cmd:match} option of {cmd:uvis}. The default, if {cmd:match()} is not specified, is to draw from the posterior predictive distribution of each variable requiring imputation. {marker passive}{...} {phang} {opt passive(passivelist)} allows the use of "passive" imputation of variables that depend on other variables, some of which are imputed. The syntax of {it:passivelist} is {it:varname}{cmd::}{it:exp} [{cmd:\}{it:varname}{cmd::}{it:exp} ...]. Notice the requirement to use "\" as a separator between items in {it:passivelist}, rather than the usual comma; the reason is that a comma may be a valid part of an expression. The option is most easily explained by example. Suppose x1 is a categorical variable with 3 levels, and that two dummy variables x1a, x1b have been created by the commands {pin} {cmd:. generate byte x1a=(x1==2)}{break} {cmd:. generate byte x1b=(x1==3)} {pin} Now suppose that x1 is to be imputed by the {cmd:mlogit} command, and is to be treated as the two dummy variables x1a and x1b when predicting other variables. Use of {cmd:mlogit} is achieved by the option {cmd:cmd(x1:mlogit)}. When x1 is imputed, we want x1a and x1b to be updated with new values which depend on the imputed values of x1. This may be achieved by specifying {cmd:passive(x1a:x1==2 \ x1b:x1==3)}. It is necessary also to remove x1 from the list of predictors when variables other than x1 are being imputed, and this is done by using the {cmd:substitute()} option; in the present example, you would specify {cmd:substitute(x1:x1a x1b)}. {pin} Note that although in this example x1a will take the (possibly unintended) value of 0 when x1 is missing, {cmd:ice} is careful to ensure that x1a (and x1b) inherit the missingness of x1, and are passively imputed following active imputation of missing values of x1. If this were not done, incorrect results could occur. The responsibility of the user is to create x1a and x1b before running {cmd:ice} such that their missing values are identical to those of x1. {pin} A second example is multiplicative interactions between variables, for example, between x1 and x2 (e.g. x12=x1*x2); this could be entered as {cmd:passive(x12:x1*x2)}. It would cause the interaction term x12 to be omitted when either x1 or x2 was being imputed, since it would make no sense to impute x1 from its interaction with x2. {cmd:substitute()} is not needed here. {pin} It should be stressed that variables to be imputed passively must already exist and must be included in {it:mainvarlist}, otherwise they are not recognised. Passive variables may be defined in terms of variables in {it:mainvarlist} and variables not in {it:mainvarlist}, although it would of course make no sense not to involve at least one variable in {it:mainvarlist}. {phang} {cmd:saving(}{it:filename} [{cmd:,replace}]{cmd:)} saves the imputation to {it:filename}. {opt replace} allows {it:filename} to be overwritten with new data. {cmd:replace} may not be abbreviated. {phang} {opt stepwise} constructs prediction equations by stepwise variable selection among members of {it:mainvarlist}. There are 3 steps to the process. First, {cmd:ice} creates a dataset with 1 imputation using a randomly drawn subset of values from the distribution of each variable with missing values. (This is the standard initialisation step for {cmd:ice}, and is invoked automatically by the {opt initialonly} option.) Next, {cmd:ice} runs {helpb stepwise} to select variables for each prediction equation. Binary dummy variables are treated appropriately. By default, forward selection at a 5% significance level is used; see the {opt swopts()} option for other possibilities. Finally, {cmd:ice} retrieves the reduced equations and performs imputation with them as usual. {pmore} Using {opt stepwise} also causes {cmd:ice} to store {it:mainvarlist}, the selected equations, variables and commands in global macros called {cmd:$ice_*}, as described under the {opt eq()} option. {phang} {opt swopts(stepwise_options)} allows the following {it:stepwise_options} for use with {cmd:stepwise}: {opt forward}, {opt group(group_list)}, {opt lock(varlist)}, {opt pe(#)}, {opt pr(#)} and {opt show}. Note that only {opt pe(#)}, {opt pr(#)} and {opt forward} are standard options of Stata's {cmd:stepwise} command; the remainder are used to group variables for joint testing for inclusion or exclusion from the models, to construct a list of variables formatted for use with {cmd:stepwise}'s {opt lockterm1} option, and to show the output from {cmd:stepwise}. Further details of individual options are given below under {cmd:ice} ({it:stepwise options}). {pmore} Specifying neither {opt pe(#)} nor {opt pr(#)} is equivalent to specifying {cmd:pe(0.05)}, i.e. the default method is forward selection of variables significant at the 5% level. {pmore} Note that variables in {it:mainvarlist} that have the prefix {cmd:i.}, indicating that they are categorical, are to be represented by their dummy variables and have no missing data, should retain their {cmd:i.} prefix when they are included in the {opt group()} or {opt lock()} options. {dlgtab:ice (stepwise options)} {phang} {opt forward} specifies the forward-stepwise method and may be specified only when both {opt pr()} and {opt pe()} are also specified. Specifying both {opt pr()} and {opt pe()} without {opt forward} results in backward-stepwise selection. Specifying only {opt pr()} results in backward selection, and specifying only {opt pe()} results in forward selection. {phang} {opt group(group_list)} specifies variables always to be tested jointly for inclusion or exclusion from models. An element of {it:group_list} is a {it:varlist}, and elements are separated by commas, for example {cmd:group(x1 i.x2, y1 y2)}. Such groups of variables (or, in the case of categorical variables prefixed with {cmd:i.}, their implied dummy variables) are surrounded by parentheses when presented to {cmd:stepwise} for analysis. {phang} {opt lock(varlist)} specifies variables to be kept in all models. Such variables are surrounded by parentheses when presented to {cmd:stepwise} for analysis. The {opt lockterm1} option of {cmd:stepwise} is applied to them. {phang} {opt pe(#)} specifies the significance level for addition to the model; terms with p < {opt pe()} are eligible for addition. {phang} {opt pr(#)} specifies the significance level for removal from the model; terms with p >= {opt pr()} are eligible for removal. {phang} {opt show} displays the output from {cmd:stepwise} for each regression analysis to develop the prediction equations used by {cmd:ice}. {dlgtab:ice (less used options)} {phang} {opt allmissing} imputes missing values in observations in which all variables in {it:mainvarlist} are missing. The default is to leave such values as missing. {phang} {cmd:boot}[{cmd:(}{it:varlist}{cmd:)}] instructs that each member of {it:varlist}, a subset of {it:mainvarlist}, be imputed with the {cmd:boot} option of {cmd:uvis} activated. If {cmd:(}{it:varlist}{cmd:)} is omitted then all members of {it:mainvarlist} with missing observations are imputed using the {opt boot} option of {opt uvis}. {phang} {opt by(varlist)} performs multiple imputation separately for all combinations of variables in {it:varlist}. Observations with missing values for any members of {it:varlist} are excluded. May be combined with {opt restrict()}. {phang} {opt cc(varlist)} prevents imputation of missing data in {it:mainvarlist} for cases in which any member of {it:varlist} has a missing value. "cc" signifies "complete case". Note that members of {it:varlist} are used for imputation if they appear in {it:mainvarlist}, but not otherwise. Use of this option is equivalent to entering {cmd:if} {cmd:~missing(}{it:var1}{cmd:) &} {cmd:~missing(}{it:var2}{cmd:)} ..., where {it:var1}, {it:var2}, ... denote the members of {it:varlist}. {phang} {marker cmd}{opt cmd(cmdlist)} defines the regression commands to be used for each variable in {it:mainvarlist}, when it becomes the dependent variable in the switching regression procedure used by {cmd:uvis} (see {help ice##algorithm:Algorithm used by uvis}). The first item in {it:cmdlist} may be a command such as {cmd:regress} or may have the syntax {it:varlist}{cmd::}{it:cmd}, specifying that command {it:cmd} applies to all the variables in {it:varlist}. Subsequent items in {it:cmdlist} must follow the latter syntax, and each item should be followed by a comma. {pin} The default {it:cmd} for a variable is {cmd:logit} when there are two distinct values, {cmd:mlogit} when there ar 3-5 and {cmd:regress} otherwise. {phang2} Example: {cmd:cmd(regress)} specifies that all variables are to be imputed by {cmd:regress}, over-riding the defaults {phang2} Example: {cmd:cmd(x1 x2:logit, x3:regress)} specifies that {cmd:x1} and {cmd:x2} are to be imputed by {cmd:logit}, {cmd:x3} by {cmd:regress} and all others by their default choices {pin} {it:Advanced use}: If a {it:cmd} is implicitly defined for a variable by a {cmd:o.} or {cmd:m.} prefix and the {cmd:cmd()} option is used explicitly for that same variable then the explicit use takes precedence over the implicit use. For example, the combination ... {cmd:o.x1, cmd(x1:regress)} would impute {cmd:x1} with {cmd:regress} rather than with the implicit {cmd:ologit}. Used with {cmd: match(x1)}, this would give a reasonable alternative to ordinal logistic regression for imputing an ordered categorical variable {cmd:x1}. {phang} {opt conditional(condlist)} invokes conditional imputation. Each item of {it:condlist} has the form {it:varlist}{cmd::} {it:condition}. Items are separated by backslash ({hi:\}). The idea is that members of {it:varlist} are only informative when {it:condition} is true, and that they take some {it:pre-determined value} when {it:condition} is false. {pmore} Important: This option was not correctly implemented in versions of {cmd:ice_} before 1.2.2 – use {stata which ice_} to check your version. {pmore} Conditional imputation requires that (i) when any variable included in {it:condition} is missing, all variables in {it:varlist} are missing, and (ii) when {it:condition} is false, each variable in {it:varlist} takes only one value (the {it:pre-determined value}, which might be 0 or a unique "not-applicable" code such as 99). {pmore} In detail, members of {it:varlist} are imputed in the usual way for the subset of observations for which {cmd:if} {it:condition} is true (i.e. {it:condition} evaluates to a non-zero quantity). For the subset of observations for which {cmd:if} {it:condition} is false, the {it:pre-determined value} is identified from the data for each member of {it:varlist} and is used to impute any missing values for that variable. An example is given below. {pmore} {it:condition} is a Stata expression constructed so that {cmd:if} {it:condition} can be evaluated for the current dataset. Variables appearing in {it:condition} may be members of {it:mainvarlist} or merely variables in the dataset. The only other situation in {cmd:ice} in which variables that do not appear in {it:mainvarlist} may be used is described under the {opt passive()} option. {pin} Consider a simple example, a dataset comprising three incomplete variables {hi:age}, {hi:female}, and {hi:pregnant}, where {hi:female} is 1 for females, 0 for males, and {hi:pregnant} is 1 for pregnant, 0 for not pregnant. Since males can't be pregnant, we wish to impute missing values of {hi:pregnant} using only data from females. If we impute someone with missing gender as male, we want their pregnancy status always to be imputed as non-pregnant. If males are simply coded as non-pregnant then the {it:pre-determined value} is the value of {hi:pregnant} denoting non-pregnant, i.e. 0; if instead males are coded as pregnant=99 then the {it:pre-determined value} is 99. In either case, we implement the conditional imputation as follows: {phang2}{cmd:. ice age pregnant female, conditional(pregnant: female==1) clear}{p_end} {pin} Here, the prediction equation for {hi:age} is {hi:pregnant female}, that for female is {hi:age} and that for {hi:pregnant} is {hi:age if female==1}. Observations of {hi:pregnant} for originally missing observations of {hi:female} now imputed as male (i.e. {hi:female} = 0) are assigned the value 0 by {cmd:ice}. {pin} We can have dependent conditional imputation. For example, suppose a fertility test {hi:fertile}, taking the value 1 for fertile and 0 for infertile, was available just for females. We might code this as follows: {phang2}{cmd:. ice age pregnant female fertile, conditional(pregnant: female==1 & fertile==1 \ fertile: female==1) clear} {pin} which reflects that only fertile females can become pregnant, and only females have a fertility test. {phang} {opt cycles(#)} determines the number of cycles of regression switching to be carried out. Default {it:#} is 10. {phang} {opt debug} provides assistance for debugging individual regressions. As {cmd:ice} runs, it prints out, for each imputation and cycle, the name of the regression command, the variable being imputed and R2, the explained variation of the model (Nagelkerke method). At the same time, the values from the last cycle only are stored in a new file called {cmd:_ice_debug.dta}, in the current working directory. A plot of R2 against cycle number may indicate abnormalities; for example if R2 shows instability, the corresponding model may have some features that need improving. The option is useful also for detecting regression models that explain a negligible amount of variation; such models are candidates for deletion. {pin} Because only the final cycle is stored, for debugging purposes it may be most sensible to use the {opt debug} option with, say, {cmd:cycles(100)} and {cmd:m(1)}. {phang} {opt dropmissing} is a feature designed to save memory when using the file of imputed data created by {cmd:ice}. It omits from {it:filename} all observations which are not in the estimation sample, that is for which either (i) they are filtered out by {cmd:if} or {cmd:in}, or a non-positive weight, or (ii) the values of all variables in {it:mainvarlist} are missing. This option provides a "clean" analysis file of imputations, with no missing values. Note that the observations not in the estimation sample are omitted also from the original data, stored as imputation #0 in {it:filename}. {phang} {opt eqdrop(eqdroplist)} deletes variables from prediction equations. The syntax of {it:eqdroplist} is {it:varname1}{cmd::}{it:varlist1} [{cmd:,}{it:varname2}{cmd::}{it:varlist2} ...] where each {it:varname#} (or {it:varlist#}) is a member (or subset) of {it:mainvarlist}. One can only remove predictors from equations for variables with missing values (although trying to remove predictors from non-existent equations is not a fatal error - an information message is issued). Variable names prefixed by {cmd:i.} are allowed, provided that the names were prefixed by {hi:i.}, {hi:m.} or {hi:o.} in {it:mainvarlist}. They are translated to the corresponding dummy variables created by {cmd:xi:}. {phang} {opt genmiss(string)} creates an indicator variable for the missingness of data in any variable in {it:mainvarlist} for which at least one value has been imputed. The indicator variable is set to missing for observations excluded by {cmd:if}, {cmd:in}, etc. The indicator variable for {it:xvar} is named {it:string}{it:xvar}. The information on missingness is implicit in the original data, which is stored as "imputation 0". {phang} {opt id(newvarname)} creates a variable called {it:newvarname} containing the original sort order of the data. Default {it:newvarname}: {cmd:_mi}. {phang} {opt interval(intlist)} imputes interval-censored variables. An interval-censored value is one which is known to lie in an interval [a,b] where a and b are finite and a <= b, or in (-infinity,b] or in [a,infinity). When either terminal is infinite we have left or right censoring, respectively. {it:intlist} has the syntax {it:varname}{hi::}{it:llvar ulvar} [{hi:,} {it:varname}:{it:llvar ulvar} ...], where each {it:varname} is an interval-censored variable, each {it:llvar} contains the lower bound (a) for {it:varname} and each {it:ulvar} contains the upper bound (b) for {it:varname} (or a missing value to represent plus or minus infinity). The supplied values of {it:varname} are irrelevant since they will be replaced anyway; it is only required that {it:varname} exist. Observations with {it:llvar} missing and {it:ulvar} present are left-censored for {it:varname}. Observations with {it:llvar} present and {it:ulvar} missing are right-censored for {it:varname}. Observations with {it:llvar} = {it:ulvar} are complete, and no imputation is done for them. Observations with both {it:llvar} and {it:ulvar} missing are imputed assuming an uncensored normal distribution. See {help ice##interval:Interval censoring} for further information. {phang} {opt initialonly} imputes by random sampling from the distribution of the non-missing values of each variable which has missing value(s). This is the initialisation step of the MICE algorithm (see Remarks). This option may be used to get a 'quick and dirty' set of multiple imputations with which to explore initial impressions of the analysis model, or to investigate possible prediction equations for subsequent multiple imputation using the MICE method. The prediction equations that are displayed are the ones that would be used by default in a full MICE imputation run; with the {opt initialonly} option, they are ignored when imputations are produced. {phang} {marker matchpool}{opt matchpool(#)} modifies the implementation of the {cmd:match()} and {cmd:lrd} options. {cmd:match} performs predictive mean matching in which a pool of potential matches is constructed and one member of this pool is sampled (with equal probabilities). {it:#} specifies the size of this pool. The default is 10. Please note that older versions of {cmd:ice} used {it:#} = 1 and later 3. Users are cautioned against using {it:#} = 1. {phang} {opt monotone} assumes the members of {it:mainvarlist} have a monotone missingness pattern, that is, {cmd:ice} defines the prediction equations appropriately. For variables x1, ..., xk the imputation equations would be x1 on [nothing], x2 on x1, x3 on x1 x2, ... , xk on x1 x2 ... x(k-1). When the missingness really is monotonic, only one cycle of MICE is required, so the default here is {cmd:cycles(1)}. There is no advantage in specifying more than one cycle. {pmore} With the {opt monotone} option, {cmd:ice} reports a 'non-monotonicity score'. This is defined as 100 * (sum of numerators) / (sum of denominators), where the sums are taken over all adjacent pairs of variables in {it:mainvarlist}. Consider two variables, x1 and x2. The numerator for x1 and x2, i.e the non-monotonicity, is the number of observations in the estimation sample for which x1 is missing and x2 is observed. If the numerator is positive, x1 and x2 show a non-monotonic pattern. The denominator for x1 and x2 is the the number of observations in the estimation sample for which x2 is observed. {pmore} {cmd:ice} takes a relaxed view of runs in which the non-monotonicity score is positive. It warns the user but goes ahead with the imputation anyway - it assumes that the user knows what they are doing. {phang} {opt noshoweq} suppresses the presentation of the prediction equations. {phang} {opt noconstant} suppresses the regression constant in all regressions. {phang} {opt nopp} suppresses treatment of the perfect prediction bug (see {help ice##pp:Avoiding the perfect prediction bug}). {phang} {opt noverbose} suppresses display of the imputation number (as {it:#}) and cycle number within imputations (as {cmd:.}) which show the progress of the imputations. {phang} {opt nowarning} suppresses warning messages. {phang} {opt on(varlist)} changes the operation of {cmd:ice} in a major way. With this option, {cmd:uvis} imputes each member of {it:mainvarlist} univariately on {it:varlist}. This provides a convenient way of producing multiple imputations when imputation for each variable in {it:mainvarlist} is to be done univariately on a set of complete predictors. {phang} {opt orderasis} enters the variables in {it:mainvarlist} into the MICE algorithm in the order given. The default is to order them according to the number of missing values: the variable with least missingness gets imputed first, and so on. {phang} {opt persist} causes {cmd:ice} to ignore errors raised by {cmd:uvis} when trying to impute a "difficult" variable, or impute with a model that is difficult to fit to the data to hand. Trying to impute a "difficult" variable using the {cmd:ologit} or {cmd:mlogit} command is the most common cause of failure. By default, {cmd:ice} stops with an error message. With {opt persist}, {cmd:ice} continues to the next variable to be imputed, not updating the variable that raised an error. Often, by the play of chance, the "difficult" variable is successfully updated in a subsequent cycle, and no damage is done to the imputation process. {pin} If the error for a given variable appears in every cycle, you should consider changing the prediction equation for that variable, since its imputed values are unlikely to be appropriate. {pin} We do not recommend the routine use of {opt persist}. Only use it when it appears that there is sporadic failure to fit an imputation model. {phang} {cmd:restrict(}[{varname}] [{it:{help if}}]{cmd:)} specifies that imputation models be computed using the subsample identified by {it:varname} and {it:if}. {pmore} The subsample is defined by the observations for which {it:varname}!=0 that also meet the {it:if} conditions. Typically, {it:varname}=1 defines the subsample and {it:varname}=0 indicates observations not belonging to the subsample. For observations whose subsample status is uncertain, {it:varname} should be set to a missing value; such observations are dropped from the subsample. {pmore} By default {cmd:ice} fits imputation models and imputes missing values using the sample of observations identified in the {ifin} options. The {opt restrict()} option identifies a subset of this sample to be used for model estimation. Imputation is restricted to the sample identified in the {ifin} options. Thus, predictions and their associated imputations are made 'out-of-sample' with respect to the subsample defined by {opt restrict()}. {pmore} Be careful to avoid restrictions that prevent prediction for all the relevant observations. For example, models that involve {cmd:mlogit} will fail to predict 'everywhere' if the {opt restrict()} option excludes any of the levels of the target variable, as in the following example. {cmd:school} is a four-level categorical variable coded 0, 1, 2, 3: {phang2} {cmd:. gen byte ok = (school > 0) if !missing(school)}{p_end} {phang2} {cmd:. ice school house age sex bcg, clear restrict(ok)} {pmore} By default, {cmd:school} is imputed using {cmd:mlogit}. Predictions cannot be made for observations with {cmd:school==0}. {cmd:ice} will halt with error #303 (equation not found). {phang} {opt seed(#)} sets the random number seed to {it:#}. In order to reproduce a set of imputations, the same random number seed should be used. See {help ice##reproducibility:Reproducibility of results from uvis and ice} for further comments. Default {it:#}: 0, meaning no seed is set by the program; depending on the status of Stata's random number seed, different sets of imputations should be obtained on each run. {marker substitute}{...} {phang} {opt substitute(sublist)} is typically used with the {cmd:passive()} option to represent multilevel categorical variables as dummy variables in models for predicting other variables. See {cmd:passive()} for more details. The syntax of {it:sublist} is {it:varname}{cmd::}{it:dummyvarlist} [{cmd:,}{it:varname}{cmd::}{it:dummyvarlist} ...] where {it:varname} is the name of a variable to be substituted and {it:dummyvarlist} is the list of dummy variables representing it. {pin} Note, however, the following important convenience feature: {cmd:substitute()} may be used without corresponding expressions in {cmd:passive()} to recreate dummy variables automatically. If the values of variables in {it:dummyvarlist} are NOT defined through expressions involving {it:varname} in the {cmd:passive()} option, then the variables in {it:dummyvarlist} are calculated according to the actual range of values of {it:varname}. For example, suppose the options {cmd:passive(x1a:x1==2 \ x1b:x1==3)} and {cmd:substitute(x1:x1a x1b)} were specified. Provided that all the non-missing values of {cmd:x1} were 2 when {cmd:x1a}==1 and all the non-missing values of {cmd:x1} were 3 when {cmd:x1b}==1, then {cmd:passive(x1a:x1==2 \ x1b:x1==3)} is implied by {cmd:substitute(x1:x1a x1b)} and can be omitted. The rule applied by {cmd:substitute(x:dummy1 [dummy2...])} for defining dummy variables dummy1, dummy2, ... is as follows: {phang2} 1. Determine the range of values [xmin, xmax] of x for which dummy1 > 0. {phang2} 2a. If xmin < xmax, define dummy1 to be 1 if xmin <= x <= xmax and 0 otherwise. {phang2} 2b. If xmin = xmax, define dummy1 to be 1 if x = xmin and 0 otherwise. {phang2} 3. Repeat steps 1 and 2a,b for dummy2, dummy3, ... as necessary. {pin} With many such categorical variables this feature can save a lot of typing. {phang} {opt trace(trace_filename)} monitors the convergence of the imputation algorithm. For each original variable with missing values, the mean of the imputed values is stored as a variable in {it:trace_filename}, together with the cycle number at which that mean was calculated. The results are stored only for the final imputation. For diagnostic purposes, it is sensible to run {cmd:trace()} with {cmd:m(1)} and a large number of cycles, such as {cmd:cycles(100)}. When the run is complete, it is helpful to load {it:trace_filename} into memory and plot the mean for each imputed variable against the cycle number. If necessary, smoothing may be applied to clarify any apparent pattern. Convergence is judged to have occurred when the pattern of the imputed means is random. It is usually obvious from the appearance of the plot how many cycles are needed for convergence. {dlgtab:uvis} {phang} {opt boot} invokes a bootstrap method for creating imputed values (see {help ice##boot:bootstrap}). {phang} {opt by(varlist)} performs imputation separately for all combinations of variables in {it:varlist}. Observations with missing values for any members of {it:varlist} are excluded. May be combined with {opt restrict()}. {phang} {opt gen(newvar)} is not optional. {it:newvar} contains original (non-missing) and imputed (originally missing) values of {it:yvar}. {phang} {opt lrd} creates imputations by local residual draws. This method is related to predictive mean matching, but the {it:residual} is borrowed from one of the closest non-missing observations, rather than the observed value. {phang} {opt match} creates imputations by predictive mean matching. The default is to draw imputations at random from the posterior distribution of the missing values of {it:yvar}, conditional on the observed values and the members of {it:xvars}. See {help ice##match:match} for further details. {phang} {opt matchpool(#)} - see {help ice##matchpool:matchpool} for details. {phang} {opt matchtype(#)} defines how the uncertainty is represented in choosing the closest matches for the {it:match} and {it:lrd} methods. Type 1 matches the predictive mean for observed values to a {it:draw} of the predictive mean for missing values. Type 2 uses a draw of the prediction for observed and missing values. Type 3 uses a different draw for observed and missing values. Type 1 is recommended. {phang} {opt noconstant} suppresses the regression constant in all regressions. {phang} {opt noverbose} suppresses non-error messages while {cmd:uvis} is running. {phang} {opt replace} permits {it:newvar} (see {cmd:gen(}{it:newvar}{cmd:)}) to be overwritten with new data. {cmd:replace} may not be abbreviated. {phang} {cmd:restrict(}[{varname}] [{it:{help if}}]{cmd:)} specifies that the imputation model be computed using the subsample identified by {it:varname} and {it:if}. {pmore} The subsample is defined by the observations for which {it:varname}!=0 that also meet the {it:if} conditions. Typically, {it:varname}=1 defines the subsample and {it:varname}=0 indicates observations not belonging to the subsample. For observations whose subsample status is uncertain, {it:varname} should be set to a missing value; such observations are dropped from the subsample. {pmore} By default {cmd:uvis} fits the imputation model using the sample of observations identified in the {ifin} options. The {opt restrict()} option identifies a subset of this sample. {phang} {opt seed(#)} sets the random number seed to {it:#}. See {help ice##reproducibility:Reproducibility of results from uvis and ice} for comments on how to ensure reproducible imputations by using the {cmd:seed()} option. Default {it:#}: 0, meaning no seed is set by the program. {title:Remarks} {marker algorithm}{...} {pstd} {hi:{ul:Algorithm used by uvis}} {pstd} When {it:cmd} is {cmd:regress}, {cmd:uvis} imputes {it:yvar} from {it:xvars} according to the following algorithm (see van Buuren et al (1999) section 3.2 for further technical details): {phang2} 1. Estimate the vector of coefficients (beta) and the residual variance by regressing the non-missing values of {it:yvar} on the current "completed" version of {it:xvars}. Predict the fitted values {it:etaobs} at the non-missing observations of {it:yvar}. {phang2} 2. Draw at random a value (sigma_star) from the posterior distribution of the residual standard deviation. {phang2} 3. Draw at random a value (beta_star) from the posterior distribution of beta, conditional on sigma_star, thus allowing for uncertainty in beta. {phang2} 4. Use beta_star to predict the fitted values {it:etamis} at the missing observations of {it:yvar}. {phang2} 5. The imputed values are predicted directly from beta_star, sigma_star and the covariates. For imputation by linear regression, this step assumes that {it:yvar} is Normally distributed, given the covariates. For other types of imputation, samples are drawn from the appropriate distribution. {marker match}{...} {pstd} With the {cmd:match} option, step 5 is replaced by the following. For each missing observation of {it:yvar} with prediction {it:etamis}, find the {it:k} non-missing observations (where {it:k} is the number in {it:matchpool}(#)) of {it:yvar} whose prediction ({it:etaobs}) on observed data is closest to {it:etamis}. One of the closest non-missing observations {it:yobs} is selected at random and used to impute the missing value of {it:yvar}. {pstd} With the {cmd:lrd} option, the closest matches are selected using match. Again, one of the {it:k} closest non-missing observations is selected at random. The imputed value for a missing observation is {it:etamis} + ({it:yobs - etaobs}). {pstd} The default draw method is not robust to departures from Normality and may produce implausible imputations. For example, if the original distribution is skew and positive-valued, the imputed distribution will not necessarily have the appropriate amount of skewness, nor will all the imputed values necessarily be positive. Log transformation of positive variables may greatly improve the appropriateness of the imputations. {pstd} The alternative {cmd:match} method is recommended only for continuous variables when the Normality assumption is clearly untenable, even approximately. It is not necessary, nor is it implemented, for binary, ordered categorical or nominal variables. {cmd:match} may work well when the distribution of a continuous variable is very non-Normal, but it may sometimes result in biased imputations. {marker boot}{...} {pstd} With the {cmd:boot} option, steps 2-4 are replaced by a bootstrap estimation of beta_star and sigma_star, obtained by regressing {it:yvar} on {it:xvars} after taking a bootstrap sample of the non-missing observations. This has the advantage of robustness since the distribution of beta is no longer assumed to be multivariate normal. {pstd} Note that {cmd:uvis} will not impute observations for which a value of a variable in {it:xvars} is missing. However, all original (missing or non-missing) observations of {it:yvar} will be copied into {it:newvarname} in such cases. This is a change from the first release of {cmd:uvis} (with {cmd:mvis}). Previously, {it:newvarname} would be set to missing whenever a value of a variable in {it:xvars} was missing, irrespective of the value of {it:yvar}. {pstd} Missing data for ordered (or unordered) categorical covariates should be imputed by using the {cmd:ologit} (or {cmd:mlogit}) command. {cmd:match} is neither required nor implemented in these cases. {pstd} {cmd:ice} carries out multivariate imputation in {it:mainvarlist} using regression switching (van Buuren et al 1999) as follows: {phang2} 1. Ignore any observations for which {it:mainvarlist} has only missing values, or if the {cmd:cc(}{it:varlist}{cmd:)} option has been specified, for which any member of {it:varlist} has a missing value. {phang2} 2. For each variable in {it:mainvarlist} with any missing data, randomly order that variable and replicate the observed values across the missing cases. This step initialises the iterative procedure by ensuing that no relevant values are missing. {phang2} 3. For each variable in {it:mainvarlist} in turn, impute missing values by applying {cmd:uvis} with the remaining variables as covariates. {phang2} 4. Repeat step 3 {cmd:cycles()} times, replacing the imputed values with updated values at the end of each cycle. {pstd} A single imputation sample is created for each variable with any relevant missing values. {pstd} Van Buuren recommends {cmd:cycles(20)} but goes on to say that 10 or even 5 iterations are probably sufficient. We have chosen a compromise default of 10. {pstd} "Multiple imputation" (MI) implies the creation and analysis of several imputed datasets. To do this, one would run {cmd:ice} with {it:m} set to a suitable number, for example 5. To obtain final estimates of the parameters of interest and their standard errors, one would fit a model in each imputation and carry out the appropriate post-MI averaging procedure on the results from the {it:m} separate imputations. A suitable estimation tool for this purpose is {help mim}. {pstd} {hi:{ul:Handling the outcome variable}} {pstd} To avoid bias, the outcome variable must always be included in the list of variables to be used for imputation. In survival analysis, in particular, it is essential to include the censoring indicator as well as the survival time. van Buuren et al (1999) recommend a log transformation of the survival time, apparently a heuristic choice. We have shown (White & Royston 2008) that for a single binary predictor and a proportional hazards analysis model, the correct imputation model comprises the baseline cumulative hazard, the censoring indicator and the binary predictor. The theory remains approximately valid for a normally distributed predictor with a weak effect. More complex cases have not yet been investigated, but at least some guidance is now available. {pstd} {hi:{ul:Handling binary variables}} {pstd} Binary variables present no difficulty. By default, in the MICE procedure, when such a variable is the response, it is predicted from other variables by using logistic regression; when it is a covariate, it is modelled in the only way possible, effectively as a single dummy variable. {pstd} Ensure that binary variables are coded 0/1. Although, in theory, one could use {cmd:ologit} or {cmd:mlogit} to model them, in practice there is no advantage in doing so. Furthermore, do not use the {hi:i.} prefix with binary variables, since there is a speed penalty in doing so. {pstd} {hi:{ul:Handling categorical variables}} {pstd} Categorical variables with 3 or more levels may in principle be treated in different ways. By default, in {cmd:ice} variables with 3-5 levels are modelled using multinomial logistic regression ({cmd:mlogit} command) when the response, and as a single linear term when a covariate. The same behaviour occurs with the ordered logistic model ({cmd:ologit} command). Our recommended strategy is to use the {hi:m.} or {hi:o.} prefixes for variables to be imputed using unordered or ordered logistic regression. This approach removes the need to define the {opt substitute()} and {opt passive()} options, both of which can be tedious and error-prone to type. {pstd} You should be aware that unless the dataset is large, use of the {cmd:mlogit} command may produce unstable estimates if the number of levels is too large, and may compromise the accuracy of the imputations. It is hard to predict when this will occur. {marker interval}{...} {pstd} {hi:{ul:Interval censoring}} {pstd} Values of a variable y that are interval censored are imputed under the assumption that y is normally distributed with unknown mean and variance. The method, which is fast and efficient, is essentially as described for right-censored variables in section 3.3 of Royston (2001). A minor extension to allow left or interval censoring is employed. For example, if A < y < B and A and B are both finite, the imputed value for y will follow a truncated normal distribution with bounds A and B, variance parameter estimated from the data and mean given by the linear predictor for the imputation model for y. Stata's {cmd:intreg} command is used to estimate the mean and variance of y. When A and B are both missing (infinite), imputation of y simply assumes the normal distribution just mentioned, but without bounds. {pstd} If you wish to impose range limits on the imputed values, the lower and upper bound variables may be set accordingly. For example, to impute right-censored (e.g. survival) data, you would set {it:llvar} equal to all the observed times to event, whether censored or not, and {it:ulvar} to all the uncensored event times and missing for the censored times. This would cause the right-censored values to be imputed without restriction. If you wanted to bound the imputed values above, say by 10, you would specify {it:ulvar} to be 10 (rather than missing) for all the censored observations. {marker pp}{...} {pstd} {hi:{ul:Avoiding the perfect prediction bug}} {pstd} Perfect prediction may arise in {cmd:logistic}, {cmd:ologit} or {cmd:mlogit} regression models when a (usually categorical) predictor variable perfectly predicts success or failure in the outcome variable. In {cmd:ice}, perfect prediction may occur without the user's knowledge because a large number of regression models are run silently. Perfect prediction may lead to entirely inappropriate imputations. To avoid this, {cmd:uvis} checks for perfect prediction; if it is detected, {cmd:uvis} temporarily augments the data with a small number of extra observations with low weight, in such a way as to remove the perfect prediction. A message is displayed noting the variable that has the perfect prediction issue, and that the problem has been dealt with. Such treatment of the perfect prediction bug may be switched off, if desired, by using the {opt nopp} option. {pstd} {hi:{ul:Errors and diagnostics}} {pstd} {cmd:ice} may occasionally detect an anomaly when running {cmd:uvis} with a particular variable as response and a particular regression command. {cmd:ice} will then stop and report the {cmd:uvis} command it was running and the error number returned. Also, {cmd:ice} saves to a file called {hi:_ice_dump.dta} in the working directory a snapshot of the data it was using when the error occurred, while also reporting the {cmd:uvis} command it was executing. Sometimes the problem lies in a regression of a binary or categorical variable where the estimation procedure fails to converge; this is usually caused by sparse cell occupancy of the response variable. If you obtain this error you should either omit the offending variable from the imputation, or seek to combine a sparse category with another category. {pstd} Another possibility is that, again due to a defect in a particular regression command in the chained equations structure, the number of values imputed for a particular variable is less than expected. This is a serious error and again may arise from estimation problems involving a binary or categorical variable. In this situation, {cmd:ice} again saves to a file called {hi:_ice_dump.dta} in the working directory a snapshot of the data it was using in the attempted estimation, while reporting the {cmd:uvis} command it was executing. You can then investigate what may have gone wrong with the command by loading the data in {hi:_ice_dump.dta} and re-running the offending regression command. {marker reproducibility}{...} {pstd} {hi:{ul:Reproducibility of results from {cmd:uvis} and {cmd:ice}}} {pstd} Use of the option {opt seed(#)} ensures that a set of imputed values is reproduced identically for a given value of {it:#}. This is true for both {cmd:uvis} and {cmd:ice}. {pstd} Please report to the author any instances where use of {cmd:ice} or {cmd:uvis} with a fixed seed does not produce the same set of imputed values. {marker pitfalls}{...} {pstd} {hi:{ul:Pitfalls in using the i. prefix}} {pstd} {cmd:ice} commands that include {hi:i.}{it:varname} in {it:mainvarlist} need to be handled with awareness. If {it:varname} has no missing data in the estimation sample, expected results are obtained. If {it:varname} does have missing values in the estimation sample, an error message is given and {cmd:ice} stops. The "estimation sample" here is the set of observations for which at least one variable in {it:mainvarlist} has non-missing value(s). {pstd} The presence of {hi:i.} evokes {cmd:xi}, which expands {hi:i.}{it:varname} in the usual way to create {hi:_I}{it:varname}{hi:_}{it:#} dummy variables. Since {it:varname} has no missing data, the dummy variables are included in the prediction equations for other variables in {it:mainvarlist}, as required. {pstd} If {hi:i.}{it:varname} were allowed to have missing data in the estimation sample, {cmd:xi} expansion would occur as before, but each of the {hi:_I}{it:varname}{hi:_}{it:#} dummy variables would become a response variable in a prediction equation and would be predicted individually (using logistic regression). Worse, the prediction equation for each dummy variable would include the {it:other} dummy variables from {cmd:i.}{it:varname}. That is clearly nonsense. {pstd} The advice, as always, is (a) to use {cmd:dryrun} before 'production' runs if the {cmd:ice} command is at all complex, and then (b) carefully to check that {cmd:ice}'s table of prediction equations is both sensible and what you expected. {pstd} {hi:{ul:Further notes}} {pstd} {opt ice} saves all the variables in the current data to the output, whether or not they are involved in the imputation procedure. This can make the resulting dataset very large. It may therefore be sensible to drop variables not subsequently needed for modelling before running {opt ice}. {pstd} {cmd:ice} determines the order of imputing variables in the cycle of chained equations according to the amount of missing data. Variables with the least missingness are imputed first. Variables with the same amount of missingness are processed in an arbitrary order, but always in the same order. Note that if {opt ice} is run twice using identical variables (at least two of which have the same amount of missingness) and the same random number seed, but with the variables with equal missingness in a different order, slightly different imputations will be generated. The differences will be purely random and will not produce bias in subsequent parameter estimates. If the {opt boot()} option is applied to all variables, the order of variables no longer affects the results. {pstd} An important application of MI is to investigate possible models, for example prognostic models, in which selection of influential variables is required (Clark & Altman 2003). For example, the stability of the final model across the imputation samples is of interest. This area of enquiry is in its infancy. {pstd} See also Van Buuren's website http://www.multiple-imputation.com for further information and software sources. {title:Examples} {phang} {cmd:. uvis regress y x1 x2 x3, gen(ym)} {phang} {cmd:. uvis logit y x1 x2 x3, gen(y) by(x4) restrict(x5) replace noverbose} {phang} {cmd:. uvis intreg ll ul x1 x2 x3, gen(y)} {phang} {cmd:. ice x1 x2 x3, saving(imputed) m(5)} {phang} {cmd:. ice x1 x2 x3, dropmissing monotone clear m(5)} {phang} {cmd:. ice x1 x2 i.x3, clear m(5)}{p_end} {phang} [Note that x3 must have no missing values in the estimation sample] {phang} {cmd:. ice x1 x2 x3, saving(imputed) m(5) cycles(20) cc(x4 x5)} {phang} {cmd:. ice m.x1 m.x2 o.x3 x4 x5, saving(imputed) m(10) boot(x1 x2 x3) match(x4 x5) id(pid) seed(101) genmiss(M_)} {phang} {cmd:. gen x23 = x2 * x3}{p_end} {phang} {cmd:. ice o.x1 x2 x3 x23 z1 z2, saving(imputed) m(5) passive(x23:x2*x3) conditional(z1: if z2==0)} {phang} {cmd:. ice y1 y2 y3 x1 x2 x3 x4, saving(imputed) m(5) eq(y1:x1 x2 y2, y2:y1 x3 x4, y3:y1 y2) match(y3)} {phang} {cmd:. ice y1 y2 y3 x1 x2 o.x3 i.x4, saving(imputed) m(5) stepwise swopts(pe(.10) pr(.15) group(x1 x2, y1 i.x4)lock(y2 x3)) match(x3)} {phang} {cmd:. ice x1-x99, clear debug m(1) cycles(100)} {phang} {cmd:. ice x1 x2 x3, saving(imputed) m(5) cmd(x1:ologit) eqdrop(x2:x3, x1:x2)} {phang} {cmd:. ice x1 x2 x3, saving(imputed) m(5) cmd(x1:ologit) match(x2) dropmissing} {phang} {cmd:. ice x1 ll2 ul2 x2 ll3 ul3 x3, saving(imputed) m(5) interval(x2:ll2 ul2, x3:ll3 ul3)} {title:Author} {pstd} Patrick Royston, MRC Clinical Trials Unit, London.{break} pr@ctu.mrc.ac.uk {title:Further reading} {phang} van Buuren S., H. C. Boshuizen and D. L. Knook. 1999. Multiple imputation of missing blood pressure covariates in survival analysis. {it:Statistics in Medicine} {cmd:18}:681-694. Also see http://www.multiple-imputation.com. {phang} Carlin J. B., N. Li, P. Greenwood, and C. Coffey. 2003. Tools for analyzing multiple imputed datasets. {it:Stata Journal} {cmd:3(3)}:226-244. {phang} Clark T. G. and D. G. Altman. 2003. Developing a prognostic model in the presence of missing data: an ovarian cancer case-study. {it:Journal of Clinical Epidemiology} {cmd:56}28-37. {phang} Royston P. 2001. The lognormal distribution as a model for survival time in cancer, with an emphasis on prognostic factors. {it:Statistica Neelandica} {cmd:55}:89-104. {phang} Royston P. 2004. Multiple imputation of missing values. {it:Stata Journal} {cmd:4(3)}:227-241. {phang} Royston P. 2005a. Multiple imputation of missing values: update. Stata Journal {cmd:5}: 188-201. {phang} Royston P. 2005b. Multiple imputation of missing values: update of {cmd:ice}. Stata Journal {cmd:5}: 527-536. {phang} Royston P. 2007. Multiple imputation of missing values: further update of ice, with an emphasis on interval censoring. Stata Journal {cmd:7}: 445-464. {phang} White I. R. and P. Royston. 2009. Imputing missing covariate values for the Cox model. Statistics in Medicine {cmd:28}: 1982-1998. {phang} White I. R., R. Daniel and P. Royston. 2010. Avoiding bias due to perfect prediction in multiple imputation of incomplete categorical variables. Computational Statistics and Data Analysis {cmd:54}: 2267-2275. {title:Acknowledgements} {pstd} Ian White has made substantial contributions to the understanding and practical use of multiple imputation, and to the programming of {cmd:ice} and {cmd:uvis}. Ian wrote the guts of the {opt draw()} option; the idea and code for coping with perfect prediction are essentially all his. I am extremely grateful to him for his ongoing commitment to this project. {pstd} I am grateful also to Gillian Raab for pointing out certain issues with the prediction matching approach, particularly that it is only useful with continuous variables. As a result, the default imputation method has been changed from matching to drawing from the predictive distribution. Gillian also suggested imputing the variables in reverse order of the amount of missingness, and selecting the imputed value at random from the set determined by the available matching predictions. Both suggestions have been implemented. {title:Also see} {psee} On-line: help for {help mim} (if installed), {help mi ice} (if installed, Stata 11 only).