help for mim                         P Royston, JC Galati, JB Carlin & IR White
-------------------------------------------------------------------------------

Title

mim -- A prefix command for analysing and manipulating multiply imputed datasets

Syntax

mim [, mim_options] : command

mim [, replay_options]

mim_options Description ------------------------------------------------------------------------- General * category(cat_type) where cat_type is fit, manip or combine - specify whether command is estimation, data manipulation or one whose (scalar) results are to be combined using Rubin's rules noisily display output from execution of command within each of the imputed datasets

Estimation (valid only for estimation commands) dots display progress dots during model fitting from(#) fit model, starting from imputation # to(#) fit model, ending with imputation # storebv fills e(b), e(V) etc. with multiple-imputation estimates

Manipulation (valid only for data manipulation commands) + sortorder(varlist) one or more variables that uniquely identify the observations in a given imputed dataset following each execution of command

Combination (valid for a wide range of Stata commands) est(est_spec) specifies the scalar (called est) to be combined across imputations se(se_spec) specifies the standard error of est to be combined across imputations byvar uses byvar (rather than the default, statsby) to extract and store est and its SE in each imputation

------------------------------------------------------------------------- * only necessary for estimation and data manipulation commands not listed under Description + not valid for append and reshape; MANDATORY for all other data manipulation commands.

replay_options Description ------------------------------------------------------------------------- clearbv clears e(b), e(V) etc., but leaves other mim estimates intact j(#) fills e(b), e(V) etc. with estimates corresponding to imputed dataset # mcerror displays a table of Monte Carlo standard errors for quantities in the table of regression coefficients storebv same as for estimation, unless j option is specified reporting_options level and eform options supported by command -------------------------------------------------------------------------

xi is allowed as a prefix to mim, but not as prefix to command, see xi. svy is allowed as a prefix to command, see svy. version is allowed as a prefix to command, see version.

Description

mim is a prefix command for working with multiply-imputed (MIM) datasets, where command can be any of a wide range of Stata commands. The function that mim performs depends on the category of command passed to mim; either estimation, data manipulation, post estimation or utility. A limited range of commands can be used with mim without specifying the category mim_option. These are:

Estimation: regress, mean, proportion, ratio, logistic, logit, ologit, mlogit, probit, oprobit, poisson, glm, binreg, nbreg, gnbreg, blogit, clogit, cnreg, mvreg, rreg, qreg, iqreg, sqreg, bsqreg, stcox, streg, xtgee, xtreg, xtlogit, xtnbreg, xtpoisson, xtmixed, svy:regress, svy:mean, svy:proportion, svy:ratio, svy:logistic, svy:logit, svy:ologit, svy:mlogit, svy:probit, svy:oprobit, svy:poisson, stepwise

Post Estimation: lincom, testparm, predict

Data Manipulation: reshape, append, merge

Utility: check, genmiss

With one exception, command is specified with its full usual syntax. The exception is merge, where only one "using" file is allowed. Also, command may be one of two internal utility commands, check and genmiss, where the required syntaxes are

mim : check [varlist]

mim : genmiss varname

respectively (see Utility commands for more details regarding these two commands).

Note that the command stepwise expects the synatx of Stata's stepwise command, and is itself a 'prefix' command. It uses P-values from Wald tests for deciding whether to include or exclude variables in a model.

Further Stata estimation and data manipulation commands can be used with mim by specifying the mim_option category(mim_type), where mim_type may be fit for estimation commands, manip for data manipulation commands or combine for combining scalar estimates and their SE's according to Rubin's rules. See Combining estimates using Rubin's rules for more details of mim, category(combine), and Combining estimates using Rubin's rules for a warning about combining estimates in this way. Use of mim in these ways is at the user's discretion, and the results are not guaranteed.

The dataset structure used by mim is a stacked format. In Stata 11 it may be either the new flong style or that created by Royston's ice (if installed) command. Details of the dataset format may be found under MIM dataset format below. Also, please study the following remarks on how mim functions under different versions of Stata.

mim and Stata 11

With Stata 11, mim recognizes the 'old' ice-style format variables (_mi and _mj) and the new mi-style variables (_mi_id and _mi_m). Note that multiply imputed data created by ice can be imported into the mi flong style by using the command mi import ice, clear automatic. The automatic option ensures that the imputed variables are correctly registered. If you omit the option, you may encounter difficulties.

If mim is called by a Stata version below 11.0, it recognizes only _mi and _mj as format variables. If called by Stata version 11.0 or higher, mim first looks for _mi and _mj. If it fails to find them, it checks for an mi-style data structure and if necessary converts the data to style flong (see mi set and mi convert). Note that the flong style persists after mim has finished. Finally, if neither type of formatting is found, mim gives up and issues an error message.

In what follows, the format variables are called _mi_id and _mi_m with the implicit understanding that if the data are in the ice format, we mean _mi and _mj, respectively.

With Stata 11, if the data are in mi format and mim creates new variables, e.g. with the mim: predict newvar command, make sure you keep such variables unregistered. To avoid possible data loss in Stata 11 when working with mim, do NOT convert the data to a different mi style using mi convert.

When mim starts, it checks and reports which format is being used.

Options

+---------+ ----+ General +----------------------------------------------------------

category specifies the type of command that is being passed to mim, either estimation (category fit) or data manipulation (category manip).

noisily specifies that the results of the application of command to each of the individual imputed datasets should be displayed.

+------------+ ----+ Estimation +-------------------------------------------------------

dots specifies that progress dots should be displayed.

from(#) fits the specified model from imputation # (i.e. for _mi_m >= #). # must be an integer between 1 and m, the maximum value of _mi_m in the dataset. Default # is 1.

storebv specifies that the standard list of returned results for estimation commands be filled using the multiple-imputation results. In particular this forces the multiple-imputation coefficient and covariance matrix estimates into e(b) and e(V), respectively, enabling application at the user's own discretion of Stata post-estimation commands that use these quantities directly (see Replay of estimation results [advanced] for further details).

to(#) fits the specified model between imputation from() and imputation #. # must be an integer between 2 and m, where m is the maximum value of _mi_m in the dataset. Note that if # > m then # is assumed to equal m and no error is raised. Default # is m.

+--------------+ ----+ Manipulation +-----------------------------------------------------

sortorder specifies a list of one or more variables that uniquely identify the observations in each of the datasets in a mim-compatible dataset; for data manipulation, this option must specify a list of variables that together uniquely identify the observations in each dataset AFTER command has been applied to the given dataset (note that varlist cannot include _mi_id, since the _mi_m and _mi_id variables are dropped from each dataset prior to the call to command).

+-------------+ ----+ Combination +------------------------------------------------------

byvar specifies that byvar be used to execute the required stata_cmd in each imputation and store the required statistic (and optionally, its SE) in new variable(s), to be combined by mim according to Rubin's rules. The default is to use statsby. Use of byvar affects the syntax of the options est() and se(), see below.

est(est_spec) specifies the scalar est to be combined across imputations. est_spec depends on whether the byvar option is used or not. By default, statsby is used to compute est from stata_cmd according to est_spec.

The following table shows what est_spec looks like when the estimand, est, is a regression coefficient, its SE, or a quantity (usually a scalar) returned by stata_cmd in either an e() or an r() result:

--------------------------------------------------------------- Type of estimand (est) statsby (default) byvar --------------------------------------------------------------- Regression coefficient [eq]_b[varname] b(varname) SE of regression coefficient [eq]_se[varname] se(varname) Quantity returned in e() e(quantityname) e(quantityname) Quantity returned in r() r(quantityname) r(quantityname) ---------------------------------------------------------------

The optional eq refers to an 'equation'; eq may be ##, where # is an equation number, or an equation name. byvar does not currently support multiple equations.

se(se_spec) specifies the standard error of est to be used with Rubin's rules. Note that se() is optional; if omitted, only the mean of est across imputations is calculated. se_spec follows the same rules as est_spec (see est() above).

+--------+ ----+ Replay +-----------------------------------------------------------

clearbv specifies that the additional items returned using the storebv or j options be cleared, but that all other estimation results returned by mim be left intact.

j(#) specifies that the standard results returned by estimation commands be filled using the estimates from the last fit of an estimation command applied to the #th imputed dataset, and that these estimates be replayed.

mcerror displays a table of Monte Carlo standard errors for the quantities presented in the main table of multiple-imputation results. The MC standard errors measure the uncertainty in the estimated quantities due to the use of a finite number m of imputations. In general, MC error decreases as m is increased. The MC error for the regression coefficients is computed as the square root of the between-imputation variance (B) divided by the square root of the number of imputations. For the other quantities, jackknife estimates (leaving out one imputation each time) (Efron & Gong 1983) are presented. The mcerror option may not be combined with other replay options other than reporting_options, nor may it be specified at model-fitting time.

storebv, same as for estimation, unless the j option is specified.

reporting_options specifies level() and eform options supported by command.

There are no mim_options for mim: check and mim: genmiss. mim: predict allows options appropriate to predict after command - see Notes on mim: predict for further information.

Remarks

Remarks are presented under the headings MIM dataset format, Display of regression results, Combining estimates using Rubin's rules, Notes on mim: predict, and Score labels in -mlogit-.

MIM Dataset format

For a multiply-imputed dataset to be compatible with mim, the dataset must contain:

a numeric variable called _mi_m whose values identify the individual dataset to which each observation belongs, a numeric variable called _mi_id whose values identify the observations within each individual dataset.

Moreover, if the original data with missing values are to be stored in the dta file, then those observations must be identified with the value _mi_m==0, while imputed datasets are identified using positive _mi_m values. In particular, the dataset in the stack identified by _mi_m==0 is ignored for the purpose of model fitting with mim. For convenience, a multiply-imputed dataset satisfying the above requirements is called a MIM dataset.

The requirements above have been kept as simple as possible. They allow a set of multiply-imputed datasets stored in separate files to be stacked into the format required by mim using only the basic data processing commands generate, append and replace. (Nevertheless, for convenience, a dedicated command mimstack has been provided for this purpose.)

An example of a multiply imputed dataset in mim-compatible format is shown below. The original data consist of a completely observed variable y and a variable x with missing values in the 3rd, 4th and 6th observations, and there are 2 imputed copies of the original dataset in the stack.

_mi_m _mi_id y x ---------------------------------- 0 1 1.1 105 0 2 9.2 106 0 3 1.1 . 0 4 2.3 . 0 5 7.5 108 0 6 7.9 . 1 1 1.1 105 1 2 9.2 106 1 3 1.1 109.796 1 4 2.3 110.456 1 5 7.5 108 1 6 7.9 102.243 2 1 1.1 105 2 2 9.2 106 2 3 1.1 107.952 2 4 2.3 115.968 2 5 7.5 108 2 6 7.9 114.479

Display of regression results

mim displays parameter estimates (obtained by Rubin's rules - see Model fitting) and their standard errors, taking into account between- and within-imputation variation. Confidence intervals and test statistics for regression coefficients are based on the t distribution with estimated degrees of freedom (d.f.) obtained using the method of Barnard and Rubin. The final entry for each parameter estimate in the model is "FMI", standing for "fraction of missing information". For each predictor, the FMI is a function of the ratio of the between- to within-imputation variance of the estimated coefficient and its d.f.:

FMI = [r + 2/(d.f. + 3)]/(r + 1)

where r is the "relative increase in variance due to non-response" (Rubin). Since d.f. is always positive, FMI lies between 0 and 1, and since d.f. is usually considerably larger than 3, FMI is approximately r/(r + 1). The larger the value of FMI, the greater the loss of information (hence loss of precision) that has been induced in the estimated coefficient by the missing data.

It is important to remember that the reported FMI is an estimate. For a small number of imputations, the estimate may be imprecise. Just how imprecise may be gauged to some extent by increasing the number of imputations, refitting the model in mim and inspecting the resulting FMI. Combining estimates using Rubin's rules {pstd} While statistical theory guarantees the asymptotic normality of regression coefficients estimated by maximum likelihood, the same guarantee does not apply in general. One should be aware that combining estimates across imputations using Rubin’s rules may not always make sense. In particular, it assumes that the sampling distribution of the estimate is approximately normal, with the corresponding SE (if supplied). It may be appropriate to transform the scale of the parameter (e.g. Fisher’s transform for the correlation coefficient) before obtaining MI combined estimates. Notes on mim: predict {pstd} The syntax of mim: predict is {phang}mim: predict newvarname , [ predict_options ] {pstd} where predict_options are options appropriate to predict for command, the regression command just run by mim. Note that mim: predict can only predict one new variable (newvarname) at a time. Thus syntaxes of predict that allow one to predict several variables at once are disallowed. The most obvious example is mlogit. For example, suppose y was a 3-level categorical outcome variable, coded 1, 2, 3, and a model of the form mim: mlogit y explanatory_variables had just been fit. The command {phang}. mim: predict yhat1 yhat2 yhat3, xb {pstd} would result in an error message (too many variables specified), whereas following regular mlogit, it would be valid. The solution with mim: predict is {phang}. mim: predict yhat1, outcome(1) xb{p_end} {phang}. mim: predict yhat2, outcome(2) xb{p_end} {phang}. mim: predict yhat3, outcome(3) xb{p_end} {pstd} The default action for mim: predict is the same as the default for predict after command. For example, when command is logit, mim: predict produces the event probability, not the linear predictor. The option xb must be included to obtain the linear predictor. The values returned in the imputed datasets (_mj > 0) use imputation-specific parameter estimates and (if appropriate) the imputed covariate values. The values returned in the _mj = 0 section of the dataset are obtained by combining the predictions from the imputed datasets using Rubin’s rules. {pstd} As just mentioned, the across-imputation average of whatever is being predicted is stored in imputation 0 (_mj = 0). Note, however, that if after fitting (say) a mim: logit model you do mim: predict p and mim: predict xb, xb, then logit(p) = xb for _mj > 0 but not for _mj = 0. The behaviour is logical, but should nevertheless be borne in mind. {pstd} There may be better ways to perform multiple-imputation inference for a desired predicted quantity, particularly when the latter is a highly non-linear function of the original model parameters. In the case of logistic regression, for example, a user might prefer to combine on the linear predictor scale before obtaining inferences for predicted probabilities by back-transformation, i.e. mim: predict xb, xb followed by gen p = invlogit(xb), which will not give the same results as mim: predict p. There appears to be no clear statistical theory to guide these decisions. Score labels in -mlogit- {pstd} It is legal in Stata for score labels to contain periods (UK English: full stops). For example, {phang}. label define edulbl 1 "Less than H.S." 2 "H.S." 3 "Assoc. or higher"{p_end} {phang}. label values edu edulbl {pstd} is perfectly valid. Such labels define equation-names when used with the mlogit command. However, Stata does not allow them to be transferred "manually" to matrices, a feature which would stop mim in its tracks. To avoid the problem, mim converts the periods in such labels to underscores when reporting mlogit model equations. Saved results {pstd} After model fitting, mim returns results in e() as follows. {synopthdr:Result} {syntab:Matrices} {synopt:e(MIM_Q)}coefficient estimates{p_end} {synopt:e(MIM_T)}total covariance matrix estimate{p_end} {synopt:e(MIM_TLRR)}Li-Raghunathan-Rubin (1999) estimate of total covariance matrix{p_end} {synopt:e(MIM_W)}within imputation covariance matrix estimate{p_end} {synopt:e(MIM_B)}between imputation covariance matrix estimate{p_end} {synopt:e(MIM_dfvec)}vector of MI degrees of freedom{p_end} {synopt:e(MIM_lambda)}vector of fraction of missing information (FMI){p_end} {synopt:e(MIM_r)}vector of increase in variance due to missing information{p_end} {syntab:Scalars} {synopt:e(MIM_dfmin)}minimum of e(MIM_dfvec){p_end} {synopt:e(MIM_dfmax)}maximum of e(MIM_dfvec){p_end} {synopt:e(MIM_Nmin)}minimun number of observations used in estimation{p_end} {synopt:e(MIM_Nmax)}maximum number of observations used in estimation{p_end} {syntab:Macros} {synopt:e(MIM_m)}number of imputed datasets used in estimation{p_end} {synopt:e(MIM_levels)}values of _mi_m variable used in estimation{p_end} {synopt:e(MIM_prefix)}value of e(prefix) returned by command{p_end} {synopt:e(MIM_prefix2)}mim{p_end} {synopt:e(MIM_cmd)}the name of the estimation command specified in command{p_end} {synopt:e(MIM_depvar)}value of e(depvar) returned by command{p_end} {synopt:e(MIM_title)}value of e(title) returned by command{p_end} {synopt:e(MIM_properties)}value of e(properties) returned by command{p_end} {synopt:e(MIM_eform)}value of e(eform) returned by command{p_end} {syntab:Additional results (returned when storebv option is specified)} {synopt:e(b)}equal to e(MIM_Q){p_end} {synopt:e(V)}equal to e(MIM_T){p_end} {synopt:e(N)}equal to e(MIM_Nmin){p_end} {synopt:e(sample)}equal to 1 for observations in the estimation sample, 0 otherwise{p_end} {synopt:e(cmd)}equal to e(MIM_cmd){p_end} {synopt:e(depvar)}equal to e(MIM_depvar){p_end} {synopt:e(df_r)}equal to e(MIM_dfmin){p_end} {synopt:e(properties)}equal to e(MIM_properties){p_end} Examples {pstd} Examples and accompanying remarks are given under the headings Model fitting, Data manipulation, Post-estimation, Replay of estimation results [advanced], Utility commands, and Combining estimates using Rubin's rules. Model fitting {pstd} When invoked for model fitting, mim applies command to each of the imputed datasets in the current MIM dataset, and then combines the individual estimates using Rubin's rules for multiple-imputation-based inferences. In most cases fitting a statistical model to a multiply-imputed dataset with mim is simply a matter of loading the MIM-format dataset into Stata and executing the desired estimation command, prefixing it with the mim prefix. Several examples are provided below. {phang} . use mymimdataset1, clear {p_end} {phang} . mim: regress y x1 x2 x3 x4 {p_end} {phang} . use mymimdataset2, clear {p_end} {phang} . mim: logistic y x1 x2, coef {p_end} {phang} . use mymimdataset3, clear {p_end} {phang} . xi: mim: glm low age lwt i.race smoke ptl ht ui, f(bin) l(logit) le(90) {p_end} {phang} . xi: mim: stepwise, pr(0.05): glm low age lwt (i.race) smoke ptl ht ui, f(bin) l(logit) le(90) {p_end} {phang} . use mymimdataset4, clear {p_end} {phang} . mim: svy: proportion heartatk {p_end} {phang} . mim: svy: logistic heartatk age weight height {p_end} {phang} . mim, noi: svy jackknife, nodots: logit highbp height weight age age2 female black, or {p_end} {phang} . use mymimdataset5, clear {p_end} {phang} . mim: xtmixed gsp private emp water other unemp || region: R.state, l(90) {p_end} {pstd} Additionally, other Stata estimation commands may by fitted to a MIM dataset using the category(fit) option of mim. Two examples are given below. {phang} . use mymimdataset6, clear {p_end} {phang} . mim, cat(fit): mvprobit (private = years logptax loginc) (vote=years logptax loginc), nolog {p_end} {phang} . use mymimdataset7, clear {p_end} {phang} . mim, cat(fit): MyNewCommand y x1 x2 {p_end} Data manipulation {pstd} The stacked dataset format used by mim allows simple data manipulation such as generating and replacing variables to be performed using existing Stata commands. More complex data manipulation tasks, particularly those that alter the number of observations in each of the imputed datasets, usually require more detailed programming. For convenience, three common tasks, namely reshaping, appending and merging datasets, can be accomplished by prefixing the relevant command with mim. The first two are straightforward, and in most instances will be applied by simply prefixing the usual syntax with mim. {phang} . use mymimdataset7, clear {p_end} {phang} . mim: reshape wide income, i(id) j(year) {p_end} {phang} . mim: reshape long {p_end} {phang} . use mymimdataset8, clear {p_end} {phang} . mim: append using mymimdataset9 {p_end} {pstd} Merging two mim-compatible datasets requires a little further explanation, since it requires that the sortorder option be specified to mim. This option is necessary so that mim can generate a new _mi_id variable once merging is complete. For example, suppose that mymimdataset10 is a mim-compatible dataset containing patient details, with each patient having a unique id, and mymimdataset11 is a second stacked dataset containing additional longitudinal measurements on each patient, with each measurement uniquely identified by the two variables id time. Merging these data into a single dataset would usually be accomplished by a match-merge on the id variable. However, once merging is complete, the observations in the merged dataset are determined by the pair of variables id and time. Using mim the merge would be accomplished as follows: {phang} . use mymimdataset10, clear {p_end} {phang} . mim, sortorder(id time): merge id using mymimdataset11 {p_end} {pstd} Additionally, other Stata commands that either manipulate a single dataset or a master/using pair of datasets may by applied to a multiply-imputed dataset using the category option of mim. This is most likely to be of interest when command is a user-written program designed to accomplish a project-specific task. {phang} . use mymimdataset12, clear {p_end} {phang} . mim, category(manip) so(id): mystatacmd x1 x2 x3 {p_end} Post-estimation {pstd} In general Stata's standard post-estimation methods cannot be directly applied with multiply-imputed data. Methods relying on likelihood comparisons (lrtest) are not applicable because multiple imputation does not involve calculation of likelihood functions for the data. Furthermore, application of a post-estimation command directly to the multiple-imputation estimates will not in general produce valid simultaneous inferences for multiple parameters, since applying Rubin's rules to the vector of parameter estimates and their associated variance-covariance matrices does not work reliably (Li et al, 1991). Performing inferences for target parameters that are scalar (unidimensional) is however easily accomplished using Rubin's rules, and this has enabled us to create multiple-imputation versions of lincom and predict. In addition, we have implemented the method of Li et al (1991) to create a mim-specific version of testparm, which allows the testing of null hypotheses relating to a vector of parameters. Examples of the use of mim: lincom, mim: testparm and mim: predict are given below. For other post-estimation tasks see the additional remarks under Replay of estimation results [advanced]. {pstd} Warning: mim: lincom has an anomalous feature. Stata's lincom following logistic behaves atypically compared with other Stata regression commands such as stcox. If you wish to get odds ratio estimates with mim: logistic followed by mim: lincom, you should specify the model as mim: logit ..., or and the lincom command as mim: lincom exp, or. {phang} . use mymimdataset2, clear {p_end} {phang} . mim: logit y x1 x2 {p_end} {phang} . mim: lincom x1 + 2 * x2 {p_end} {phang} . mim: lincom x1 + x2, or {p_end} {phang} . mim: testparm _all {p_end} {phang} . mim: predict yhat, xb {p_end} {phang} . mim: predict yhatse, stdp {p_end} Replay of estimation results [advanced] {pstd} Multiple-imputation estimates may be replayed by simply typing mim at the command line. If the estimates for a given imputed dataset have previously been called up using the j(#) option, the overall (Rubin's rules) estimates may be re-displayed by typing mim, storebv or mim, clearbv. A level(#) option and any eform options supported by command may be specified during replay. {phang} . use mymimdataset2, clear {p_end} {phang} . mim: logit y x1 x2 {p_end} {phang} . mim, or l(90) {p_end} {pstd} Multiple-imputation estimates may be copied into e(b), e(V) etc. by specifying the storebv option during replay. Note that use of multiple-imputation estimates in this way is at the user's descretion, and validity of the results is not guaranteed. In particular, forcing the multiple-imputation estimates into e(b) and e(V) allows application of a Stata post-estimation command directly to the multiple-imputation estimates. While this may be valid in specific cases, it is certainly not valid in general (see Post-estimation for additional comments). {phang} . mim, storebv {p_end} {pstd} (Note that the storebv option may also be specified during model fitting.) {pstd} Alternatively, by specifying the j(#) option of mim, the estimates corresponding to the application of command to one of the individual imputed datasets are copied into their usual place in e() (that is, into e(b), e(V) etc.). command can also be replayed directly in this situation, for example {phang} . mim: logit y x1 x2 {p_end} {phang} . mim, j(1) {p_end} {phang} . logit, or {p_end} {pstd} displays the estimated odds ratios for imputation #1. {pstd} The facility to replay individual estimates has been incorporated with extensibility in mind, particularly with regard to post-estimation. The most likely application is to loop over the individual estimates, replaying and capturing necessary quantities from each set of results in turn, and then combining these in some way, where the standard approach for simple scalar estimation would be to use Rubin's rules. {phang} . use mymimdataset2, clear {p_end} {phang} . mim: logit y x1 x2 {p_end} {phang} . local levels `"`e(MIM_levels)'"' {p_end} {phang} . foreach j of local levels { {p_end} {phang} . quietly mim, j(`j') {p_end} {phang} . ... apply some post-estimation command or capture some stored results here ... {p_end} {phang} . } {p_end} {phang} . combine results from individual estimations using Rubin's rules ... {p_end} {pstd} Finally, to avoid inadvertent application of a Stata post-estimation command to estimates copied into e(b), e(V) etc. using either the j(#) or storebv option, the clearbv option is provided to allow one to clear these estimates when finished (without losing the multiple imputation estimates from memory). It is recommended always to make use of this facility. {phang} . mim, clearbv {p_end} Utility commands {pstd} The check command provides a detailed integrity check of a multiply imputed dataset in stacked format. The main checks are that non-missing values must be constant across imputed datasets and that all missing values must have been imputed. Note that the utility commands are only applicable when the original dataset with missing values has been included in the stacked dataset (see MIM dataset format). {phang} . use mymimdataset12, clear {p_end} {phang} . mim: check {p_end} {phang} Alternatively, the check can be restricted to selected variables. {phang} . mim: check x1 x2 x3 x4 x5 {p_end} {pstd} The genmiss command generates a missing indicator variable for a specified variable. {phang} . mim: genmiss x1 {p_end} {pstd} In this case the generated indicator variable is called _mim_x1 (and in general the naming convention used is to prefix varname with _mim_). Combining estimates using Rubin's rules {pstd} Some simple examples of mim, category(combine) may help to clarify how to use this powerful facility. One small point to note: the degrees of freedom used in calculating the t-statistic for confidence intervals are slightly larger according to mim, category(combine) than to mim when fitting regression models. The result is that mim, category(combine) gives slightly narrower confidence intervals. {pstd}1. The mean of x with its SE and 95% CI computed in different ways {pmore}Using the default calculating tool (statsby): {pmore}. mim, cat(combine) est(_b[x]) se(_se[x]) : mean x{p_end} {pmore}. mim, cat(combine) est(_b[_cons]) se(_se[_cons]) : regress x{p_end} {pmore}. mim, cat(combine) est(r(mean)) se(sqrt(r(Var)/r(N))) : ameans x{p_end} {pmore}Note the use of an expression for the SE of the mean, namely se(sqrt(r(Var)/r(N))). statsby allows this flexibility but byvar doesn't. {pmore}Using the alternative calculating tool (byvar): {pmore}. mim, cat(combine) byvar est(b(x)) se(se(x)) : mean x{p_end} {pmore}. mim, cat(combine) byvar est(b(_cons)) se(se(_cons)) : regress x{p_end} {pstd}2. Area under a ROC curve {pmore} The aim is to fit a logistic regression of y on x1 and x2, and compute the AUROC (area under the ROC curve) for the resulting linear predictor in each imputation, combine the AUROC values across imputations and report the mean AUROC with its SE and 95% CI. {pmore}. mim: logit y x1 x2{p_end} {pmore}. mim: predict xb{p_end} {pmore}. mim, cat(combine) est(r(area)) se(r(se)) : roctab y xb{p_end} {pmore}. mim, cat(combine) byvar est(r(area)) se(r(se)) : roctab y xb{p_end} {pmore} We have noticed that byvar is substantially faster than statsby in some examples; in the roctab example just given, it takes one third of the time taken by statsby. The reason appears to be that statsby executes stata_cmd first for the entire dataset, then for each imputation, whereas byvar only does it for each imputation. {pstd}3. Using a sequence of Stata commands {pmore} Note the feature of byvar that stata_cmd can be a sequence of Stata commands, separated by @. The feature is not available with statsby. {pmore} For example, the mean AUROC in the second example above could be obtained by the following single command: {pmore}. mim, cat(combine) byvar est(r(area)) : logit y x1 x2 @ lroc, nograph{p_end} {pmore} Since lroc does not return the SE of the AUROC, the se() option of mim, category(combine) is omitted and only the mean AUROC is reported. {pstd}4. Combining estimates of a parameter from a multi-equation model {pmore}This is purely a pedagogic example, since mim reports combined results for all parameters of a multi-equation model anyway: {phang2}. mim, cat(combine) est([ln_p]_b[_cons]) se([ln_p]_se[_cons]) : streg x1 x2, distribution(weibull){p_end} Authors {pstd} John C. Galati & John B. Carlin, Clinical Epidemiology & Biostatistics Unit Murdoch Children’s Research Institute & University of Melbourne{break} john.carlin@mcri.edu.au {pstd} Patrick Royston, MRC Clinical Trials Unit, London.{break} pr@ctu.mrc.ac.uk References {phang} Carlin JB, Galati JC and Royston P. 2008. A new framework for managing and analyzing multiply imputed data in Stata. Stata Journal 8(1): 49-67. {phang} Carlin JB, Li N, Greenwood P and Coffey C. 2003. Tools for analyzing multiple imputed datasets. Stata Journal 3(3): 226-244. {phang} Efron B, Gong G. 1983. A leisurely look at the bootstrap, the jackknife, and cross-validation. The American Statistician 37: 36-48. {phang} Li KH, Raghunathan TE, Rubin DB. 1991. Large-sample significance levels from multiply-imputed data using moment-based statistics and an F reference distribution. Journal of the American Statistical Association 86: 1065-1073. {phang} Royston P. 2004. Multiple imputation of missing values. Stata Journal 4(3): 227-241. {phang} Royston P. 2005. Multiple imputation of missing values: update. Stata Journal 5(2): 188-201. {phang} Royston P. 2005. Multiple imputation of missing values: update of ice. Stata Journal 5(4): 527-536. {phang} Royston P. 2007. Multiple imputation of missing values: further update of ice, with an emphasis on interval censoring. Stata Journal 7(4): 445–464. {phang} Royston P, Carlin JB and White IR. 2009. Multiple imputation of missing values: new features for mim. Stata Journal to appear. Also see {pstd} Online: help for mim, mimstack, mi