{smcl}
{* 11dec2009}{...}
{cmd:help for mim}{right:P Royston, JC Galati, JB Carlin & IR White}
{hline}
{title:Title}
{pstd}
{hi:mim} {hline 2} A prefix command for analysing and manipulating multiply imputed datasets
{title:Syntax}
{phang2}{cmd:mim} [{cmd:,} {it: mim_options}] {cmd::} {it:command}{p_end}
{phang2}{cmd:mim} [{cmd:,} {it: replay_options}]{p_end}
{synoptset 21 tabbed}{...}
{synopthdr:mim_options}
{synoptline}
{syntab:General}
{p2coldent:* {opt cat:egory(cat_type)}}where {it:cat_type} is {opt fit}, {opt manip} or
{opt combine} - specify whether {it:command} is estimation, data manipulation
or one whose (scalar) results are to be combined using Rubin's rules{p_end}
{synopt:{opt noi:sily}}display output from execution of {it:command} within each of the imputed datasets
{syntab:Estimation (valid only for estimation commands)}
{synopt:{opt dot:s}}display progress dots during model fitting{p_end}
{synopt:{opt from(#)}}fit model, starting from imputation {it:#}{p_end}
{synopt:{opt to(#)}}fit model, ending with imputation {it:#}{p_end}
{synopt:{opt st:orebv}}fills {cmd:e(b)}, {cmd:e(V)} etc. with multiple-imputation estimates{p_end}
{syntab:Manipulation (valid only for data manipulation commands)}
{p2coldent:+ {opt so:rtorder(varlist)}}one or more variables that uniquely identify the observations in
a given imputed dataset following each execution of {it:command}{p_end}
{syntab:Combination (valid for a wide range of Stata commands)}
{synopt:{opt est(est_spec)}}specifies the scalar (called {it:est})
to be combined across imputations{p_end}
{synopt:{opt se(se_spec)}}specifies the standard error of {it:est}
to be combined across imputations{p_end}
{synopt:{opt byv:ar}}uses {help byvar} (rather than the default, {help statsby})
to extract and store {it:est} and its SE in each imputation{p_end}
{synoptline}
{p 4 6 2}* only necessary for estimation and data manipulation commands not listed under {help mim##description:Description}{p_end}
{p 4 6 2}+ not valid for {help append} and {help reshape}; MANDATORY for all other data manipulation commands.{p_end}
{synopthdr:replay_options}
{synoptline}
{synopt:{opt cl:earbv}}clears {cmd:e(b)}, {cmd:e(V)} etc., but leaves other {cmd:mim} estimates intact{p_end}
{synopt:{opt j(#)}}fills {cmd:e(b)}, {cmd:e(V)} etc. with estimates corresponding to imputed dataset {it:#}{p_end}
{synopt:{opt mc:error}}displays a table of Monte Carlo standard errors for quantities in the table of regression coefficients{p_end}
{synopt:{opt st:orebv}}same as for estimation, unless {cmd:j} option is specified{p_end}
{synopt:{it:reporting_options}}level and eform options supported by {it:command}{p_end}
{synoptline}
{p 4 6 2}{cmd:xi} is allowed as a prefix to {cmd:mim}, but not as prefix to {it:command}, see {help xi}.{p_end}
{p 4 6 2}{cmd:svy} is allowed as a prefix to {it:command}, see {help svy}.{p_end}
{p 4 6 2}{cmd:version} is allowed as a prefix to {it:command}, see {help version}.{p_end}
{p2colreset}{...}
{marker description}{...}
{title:Description}
{pstd}
{cmd:mim} is a prefix command for working with multiply-imputed (MIM) datasets, where
{it:command} can be any of a wide range of Stata commands. The function that {cmd:mim}
performs depends on the category of {it:command} passed to {cmd:mim}; either estimation,
data manipulation, post estimation or utility. A limited range of commands can be used
with {cmd:mim} without specifying the {cmd:category} mim_option. These are:
{pin}
{it:Estimation:}
{help regress},
{help mean},
{help proportion},
{help ratio},
{help logistic},
{help logit},
{help ologit},
{help mlogit},
{help probit},
{help oprobit},
{help poisson},
{help glm},
{help binreg},
{help nbreg},
{help gnbreg},
{help blogit},
{help clogit},
{help cnreg},
{help mvreg},
{help rreg},
{help qreg},
{help iqreg},
{help sqreg},
{help bsqreg},
{help stcox},
{help streg},
{help xtgee},
{help xtreg},
{help xtlogit},
{help xtnbreg},
{help xtpoisson},
{help xtmixed},
{help "svy:regress"},
{help "svy:mean"},
{help "svy:proportion"},
{help "svy:ratio"},
{help "svy:logistic"},
{help "svy:logit"},
{help "svy:ologit"},
{help "svy:mlogit"},
{help "svy:probit"},
{help "svy:oprobit"},
{help "svy:poisson"},
{help "stepwise"}
{pin}
{it:Post Estimation:}
{help lincom},
{help testparm},
{help predict}
{pin}
{it:Data Manipulation:}
{help reshape},
{help append},
{help merge}
{pin}
{it:Utility:}
{cmd:check},
{cmd:genmiss}
{pstd}
With one exception, {it:command} is specified with its full usual syntax. The
exception is {help merge}, where only one "using" file is allowed. Also,
{it:command} may be one of two internal utility commands, {cmd:check} and
{cmd:genmiss}, where the required syntaxes are
{pin}{cmd:mim} {cmd::} {cmd:check} [{it:varlist}]{p_end}
{pin}{cmd:mim} {cmd::} {cmd:genmiss} {it:varname}{p_end}
{pstd}
respectively (see {help mim##utility:Utility commands} for more details regarding these
two commands).
{pstd}
Note that the {it:command} {opt stepwise} expects the synatx of Stata's
{helpb stepwise} command, and is itself a 'prefix' command. It uses
P-values from Wald tests for deciding whether to include or exclude
variables in a model.
{pstd}
Further Stata estimation and data manipulation commands can be used with {cmd:mim} by specifying
the mim_option {cmd:category(}{it:mim_type}{cmd:)}, where {it:mim_type} may be {cmd:fit}
for estimation commands, {cmd:manip} for data manipulation commands or {cmd:combine} for
combining scalar estimates and their SE's according to Rubin's rules. See
{help mim##combine_estimates:Combining estimates using Rubin's rules}
for more details of {cmd:mim, category(combine)}, and
{help mim##combine_estimates_r:Combining estimates using Rubin's rules}
for a warning about combining estimates in this way. Use of {cmd:mim} in these
ways is at the user's discretion, and the results are not guaranteed.
{pstd}
The dataset structure used by {cmd:mim} is a stacked format. In Stata 11 it may
be either the new {it:flong} style or that created by Royston's
{help ice} (if installed) command. Details of the dataset format may be found under
{help mim##format:mim dataset format} below. Also, please study the following
remarks on how {cmd:mim} functions under different versions of Stata.
{title:Options}
{dlgtab:General}
{phang}
{cmd:category} specifies the type of command that is being passed to {cmd:mim}, either
estimation (category {cmd:fit}) or data manipulation (category {cmd:manip}).
{phang}
{cmd:noisily} specifies that the results of the application of {it:command} to each of the
individual imputed datasets should be displayed.
{dlgtab:Estimation}
{phang}
{cmd:dots} specifies that progress dots should be displayed.
{phang}
{opt from(#)} fits the specified model from imputation {it:#} (i.e. for
{cmd:_mi_m >= }{it:#}). {it:#} must be an integer between 1
and {it:m}, the maximum value of {cmd:_mi_m} in the dataset.
Default {it:#} is 1.
{phang}
{cmd:storebv} specifies that the standard list of returned results for estimation commands be
filled using the multiple-imputation results. In particular this forces the multiple-imputation
coefficient and covariance matrix estimates into {cmd:e(b)} and {cmd:e(V)}, respectively, enabling
application at the user's own discretion of Stata post-estimation commands that use these quantities
directly (see {help mim##replay:Replay of estimation results [advanced]} for further details).
{phang}
{opt to(#)} fits the specified model between imputation {cmd:from()} and imputation {it:#}.
{it:#} must be an integer between 2 and {it:m}, where
{it:m} is the maximum value of {cmd:_mi_m}
in the dataset. Note that if
{it:#} > {it:m} then {it:#} is assumed to equal {it:m} and no
error is raised. Default {it:#} is {it:m}.
{dlgtab:Manipulation}
{phang}
{cmd:sortorder} specifies a list of one or more variables that uniquely identify the
observations in each of the datasets in a {cmd:mim}-compatible dataset; for data
manipulation, this option must specify a list of variables that together uniquely identify the
observations in each dataset AFTER {it:command} has been applied to the given dataset
(note that {it:varlist} cannot include {cmd:_mi_id}, since the {cmd:_mi_m} and {cmd:_mi_id}
variables are dropped from each dataset prior to the call to {it:command}).
{dlgtab:Combination}
{phang}
{opt byvar} specifies that {cmd:byvar} be used to execute the required {it:stata_cmd}
in each imputation and store the required statistic (and optionally, its SE) in new
variable(s), to be combined by {cmd:mim} according to Rubin's rules. The default is to use
{cmd:statsby}. Use of {opt byvar} affects the syntax of the options {opt est()}
and {opt se()}, see below.
{phang}
{opt est(est_spec)} specifies the scalar {it:est} to be combined across
imputations. {it:est_spec} depends on whether the {opt byvar} option is used
or not. By default, {cmd:statsby} is used to compute {it:est} from
{it:stata_cmd} according to {it:est_spec}.
{pmore}
The following table shows what {it:est_spec} looks like when the estimand,
{it:est}, is a regression coefficient, its SE, or a quantity (usually a scalar)
returned by {it:stata_cmd} in either an {cmd:e()} or an {cmd:r()} result:
{center:{hline 63}}
{center: Type of estimand ({it:est}) {cmd:statsby} (default) {cmd:byvar} }
{center:{hline 63}}
{center: Regression coefficient [{it:eq}]{cmd:_b[}{it:varname}{cmd:]} {opt b(varname)} }
{center: SE of regression coefficient [{it:eq}]{cmd:_se[}{it:varname}{cmd:]} {opt se(varname)} }
{center: Quantity returned in e() {opt e(quantityname)} {opt e(quantityname)} }
{center: Quantity returned in r() {opt r(quantityname)} {opt r(quantityname)} }
{center:{hline 63}}
{pmore}
The optional {it:eq} refers to an 'equation'; {it:eq} may be {cmd:#}{it:#},
where {it:#} is an equation number, or an equation name. {cmd:byvar} does not
currently support multiple equations.
{phang}
{opt se(se_spec)} specifies the standard error of {it:est} to be used
with Rubin's rules. Note that {opt se()} is optional; if omitted, only the
mean of {it:est} across imputations is calculated. {it:se_spec} follows
the same rules as {it:est_spec} (see {opt est()} above).
{dlgtab:Replay}
{phang}
{cmd:clearbv} specifies that the additional items returned using
the {cmd:storebv} or {cmd:j} options be cleared, but that all other
estimation results returned by {cmd:mim} be left intact.
{phang}
{opt j(#)} specifies that the standard results returned by estimation
commands be filled using the estimates from the last fit of an estimation
command applied to the {it:#}th imputed dataset, and that these estimates
be replayed.
{phang}
{opt mcerror} displays a table of Monte Carlo standard errors for
the quantities presented in the main table of multiple-imputation
results. The MC standard errors measure the uncertainty in the estimated
quantities due to the use of a finite number m of imputations.
In general, MC error decreases as m is increased.
The MC error for the regression coefficients is
computed as the square root of the between-imputation variance
(B) divided by the square root of the number of imputations.
For the other quantities, jackknife estimates (leaving out one
imputation each time) (Efron & Gong 1983) are presented.
The {opt mcerror} option may not be combined with other replay
options other than {it:reporting_options},
nor may it be specified at model-fitting time.
{phang}
{cmd:storebv}, same as for estimation, unless the {cmd:j} option is specified.
{phang}
{it:reporting_options} specifies {opt level()} and {opt eform} options
supported by {it:command}.
{pstd}
There are no {it:mim_options} for {cmd:mim: check} and {cmd:mim: genmiss}.
{cmd:mim: predict} allows options appropriate to {cmd:predict}
after {it:command} - see {help mim##mimpredict:Notes on mim: predict} for
further information.
{title:Remarks}
{pstd}
Remarks are presented under the headings
{it:mim and Stata 11},
{it:mim dataset format},
{it:Display of regression results},
{it:Combining estimates using Rubin's rules},
{it:Notes on mim: predict},
{it:Running mim in more than one instance of Stata},
and {it:Score labels in -mlogit-}.
{title:mim and Stata 11}
{pstd}
With Stata 11, {cmd:mim} recognizes the 'old' ice-style format
variables ({cmd:_mi} and {cmd:_mj}) and the new {cmd:mi}-style variables
({cmd:_mi_id} and {cmd:_mi_m}). Note that multiply imputed data created
by {help ice} can be imported into the {cmd:mi} {it:flong} style by
using the command {help mi import ice}{cmd:, clear automatic}. The
{opt automatic} option ensures that the imputed variables are
correctly registered. If you omit the option, you may encounter
difficulties.
{pstd}
If {cmd:mim} is called by a Stata version below 11.0, it recognizes only
{cmd:_mi} and {cmd:_mj} as format variables. If called by Stata version 11.0
or higher, {cmd:mim} first looks for {cmd:_mi} and {cmd:_mj}. If it
fails to find them, it checks for an {cmd:mi}-style data structure
and if necessary converts the data to style {it:flong} (see {help mi set}
and {help mi convert}). Note that the {it:flong} style persists after {cmd:mim}
has finished. Finally, if neither type of formatting is found,
{cmd:mim} gives up and issues an error message.
{pstd}
In what follows, the format variables are called {cmd:_mi_id} and {cmd:_mi_m}
with the implicit understanding that if the data are in the {cmd:ice} format, we
mean {cmd:_mi} and {cmd:_mj}, respectively.
{pstd}
With Stata 11, if the data are in {cmd:mi} format and {cmd:mim} creates
new variables, e.g. with the {cmd:mim: predict} {it:newvar} command,
make sure you keep such variables unregistered. To avoid possible data
loss in Stata 11 when working with {cmd:mim}, do NOT convert the data
to a different {cmd:mi} style using {help mi convert}.
{pstd}
When {cmd:mim} starts, it checks and reports which format is being used.
{marker format}{...}
{title:mim dataset format}
{pstd}
For a multiply-imputed dataset to be compatible with {cmd:mim}, the dataset must contain:
{phang2}
a numeric variable called {cmd:_mi_m} whose values identify the individual dataset to
which each observation belongs,
{p_end}
{phang2}
a numeric variable called {cmd:_mi_id} whose values identify the observations within
each individual dataset.
{pstd}
Moreover, if the original data with missing values are to be stored in the dta file, then those
observations must be identified with the value {cmd:_mi_m==0}, while imputed datasets are identified
using positive {cmd:_mi_m} values. In particular, the dataset in the stack identified by
{cmd:_mi_m==0} is ignored for the purpose of model fitting with {cmd:mim}. For convenience, a
multiply-imputed dataset satisfying the above requirements is called a {cmd:MIM dataset}.
{pstd}
The requirements above have been kept as simple as possible. They allow a set of multiply-imputed datasets
stored in separate files to be stacked into the format required by {cmd:mim} using only the
basic data processing commands {cmd:generate}, {cmd:append} and {cmd:replace}. (Nevertheless,
for convenience, a dedicated command {help mimstack} has been provided for this purpose.)
{pstd}
An example of a multiply imputed dataset in {cmd:mim}-compatible format is shown below. The
original data consist of a completely observed variable y and a variable x with missing
values in the 3rd, 4th and 6th observations, and there are 2 imputed copies of the original
dataset in the stack.
{center: {cmd:_mi_m} {cmd:_mi_id} {cmd:y} {cmd:x} }
{center:{hline 34}}
{center: 0 1 1.1 105 }
{center: 0 2 9.2 106 }
{center: 0 3 1.1 . }
{center: 0 4 2.3 . }
{center: 0 5 7.5 108 }
{center: 0 6 7.9 . }
{center: 1 1 1.1 105 }
{center: 1 2 9.2 106 }
{center: 1 3 1.1 109.796 }
{center: 1 4 2.3 110.456 }
{center: 1 5 7.5 108 }
{center: 1 6 7.9 102.243 }
{center: 2 1 1.1 105 }
{center: 2 2 9.2 106 }
{center: 2 3 1.1 107.952 }
{center: 2 4 2.3 115.968 }
{center: 2 5 7.5 108 }
{center: 2 6 7.9 114.479 }
{marker display}{...}
{title:Display of regression results}
{pstd}
{opt mim} displays parameter estimates (obtained by Rubin's rules -
see {help mim##fitting:Model fitting}) and their standard
errors, taking into account between- and within-imputation variation.
Confidence intervals and test statistics for regression coefficients
are based on the t distribution with estimated degrees of freedom
(d.f.) obtained using the method of Barnard and Rubin. The final entry
for each parameter estimate in the model is "FMI", standing for
"fraction of missing information". For each predictor, the FMI is a
function of the ratio of the between- to within-imputation variance of
the estimated coefficient and its d.f.:
{pmore}FMI = [r + 2/(d.f. + 3)]/(r + 1)
{pstd}
where r is the "relative increase in variance due to non-response"
(Rubin). Since d.f. is always positive, FMI lies between 0 and 1, and
since d.f. is usually considerably larger than 3, FMI is approximately
r/(r + 1).
The larger the value of FMI, the greater the loss of information
(hence loss of precision) that has been induced in the estimated
coefficient by the missing data.
{pstd}
It is important to remember that the reported FMI is an {it:estimate}.
For a small number of imputations, the estimate may be imprecise.
Just how imprecise may be gauged to some extent by increasing the
number of imputations, refitting the model in {opt mim} and inspecting
the resulting FMI.
{marker combine_estimates_r}{...}
{title:Combining estimates using Rubin's rules}
{pstd}
While statistical theory guarantees the asymptotic normality of
regression coefficients estimated by maximum likelihood, the same guarantee
does not apply in general. One should be aware that combining estimates
across imputations using Rubin’s rules may not always make sense.
In particular, it assumes that the sampling distribution of the
estimate is approximately normal, with the corresponding SE (if supplied).
It may be appropriate to transform the scale of the parameter
(e.g. Fisher’s transform for the correlation coefficient) before obtaining
MI combined estimates.
{marker mimpredict}{...}
{title:Notes on mim: predict}
{pstd}
The syntax of {cmd:mim: predict} is
{phang}{cmd:mim: predict} {it:newvarname} {cmd:,} [ {it:predict_options} ]
{pstd}
where {it:predict_options} are options appropriate to {cmd:predict} for
{it:command}, the regression command just run by {cmd:mim}. Note that
{cmd:mim: predict} can only predict one new variable ({it:newvarname})
at a time. Thus syntaxes of {cmd:predict}
that allow one to predict several variables at once are disallowed. The most
obvious example is {cmd:mlogit}. For example, suppose {cmd:y} was a 3-level
categorical outcome variable, coded 1, 2, 3, and a model of the form
{cmd:mim: mlogit y} {it:explanatory_variables} had just been fit. The
command
{phang}{cmd:. mim: predict yhat1 yhat2 yhat3, xb}
{pstd}
would result in an error message ({cmd:too many variables specified}), whereas
following regular {cmd:mlogit}, it would be valid. The solution with
{cmd:mim: predict} is
{phang}{cmd:. mim: predict yhat1, outcome(1) xb}{p_end}
{phang}{cmd:. mim: predict yhat2, outcome(2) xb}{p_end}
{phang}{cmd:. mim: predict yhat3, outcome(3) xb}{p_end}
{pstd}
The default action for {cmd:mim: predict} is the same as the default for
{cmd:predict} after {it:command}. For example, when {it:command} is
{cmd:logit}, {cmd:mim: predict} produces the event probability, not the linear
predictor. The option {opt xb} must be included to obtain the linear predictor.
The values returned in the imputed datasets ({cmd:_mj} > 0) use
imputation-specific parameter estimates and (if appropriate) the imputed
covariate values. The values returned in the {cmd:_mj} = 0 section of the
dataset are obtained by combining the predictions from the imputed datasets
using Rubin’s rules.
{pstd}
As just mentioned, the across-imputation average of whatever is being predicted
is stored in
imputation 0 ({cmd:_mj} = 0). Note, however, that if after fitting (say) a
{cmd:mim: logit} model you do {cmd:mim: predict p} and {cmd:mim: predict xb, xb},
then logit({cmd:p}) = {cmd:xb} for {cmd:_mj} > 0 but not for {cmd:_mj} = 0.
The behaviour is logical, but should nevertheless be borne in mind.
{pstd}
There may be better ways to perform multiple-imputation inference for a
desired predicted quantity, particularly when the latter is a highly
non-linear function of the original model parameters.
In the case of logistic regression, for example, a user might prefer to
combine on the linear predictor scale before obtaining inferences for
predicted probabilities by back-transformation, i.e.
{cmd:mim: predict xb, xb} followed by {cmd:gen p = invlogit(xb)}, which
will not give the same results as {cmd:mim: predict p}. There appears to
be no clear statistical theory to guide these decisions.
{title:Running mim in more than one instance of Stata}
{pstd}
In Stata 10 and higher, to maximize speed of operation,
{cmd:mim} stores the estimates from models
fit to the imputed datasets on disk, in the default system
temporary folder (whose precise name varies from computer to
computer, depending on operating system preferences). The
estimates are stored in files named {bf:mim_ests}{it:#}{bf:.ster},
where {it:#} runs from 1 to m. If you run {cmd:mim} simultaneously
in more than
one instance of Stata, these files may interfere with one another,
resulting in an obscure-looking fatal error
(typically reported as {bf:r(603)}). For example, running simulation
studies involving {cmd:mim} in several copies of Stata at once
could cause the problem.
{pstd}
To avoid such clashes, we recommend that you tell {cmd:mim}
to store its estimates in memory. Load
the data to be analysed and issue the command
{pmore}{cmd:. char _dta[mim_ests] "memory"}
{pstd}The setting is cancelled by entering
{pmore}{cmd:. char _dta[mim_ests]}
{pstd}
in the current Stata session.
{title:Score labels in -mlogit-}
{pstd}
It is legal in Stata for score labels to contain periods (UK English: full
stops). For example,
{phang}{cmd:. label define edulbl 1 "Less than H.S." 2 "H.S." 3 "Assoc. or higher"}{p_end}
{phang}{cmd:. label values edu edulbl}
{pstd}
is perfectly valid. Such labels define equation-names when used with the
{cmd:mlogit} command. However, Stata does not allow them to be transferred
"manually" to matrices, a feature which would stop {cmd:mim} in its tracks.
To avoid the problem, {cmd:mim} converts the periods
in such labels to underscores when reporting {cmd:mlogit} model equations.
{marker results}{...}
{title:Saved results}
{pstd}
After model fitting, {cmd:mim} returns results in {cmd:e()} as follows.
{synoptset 18 tabbed}{...}
{synopthdr:Result}
{synoptline}
{syntab:{it:Matrices}}
{synopt:{cmd:e(MIM_Q)}}coefficient estimates{p_end}
{synopt:{cmd:e(MIM_T)}}total covariance matrix estimate{p_end}
{synopt:{cmd:e(MIM_TLRR)}}Li-Raghunathan-Rubin (1999) estimate of total covariance matrix{p_end}
{synopt:{cmd:e(MIM_W)}}within imputation covariance matrix estimate{p_end}
{synopt:{cmd:e(MIM_B)}}between imputation covariance matrix estimate{p_end}
{synopt:{cmd:e(MIM_dfvec)}}vector of MI degrees of freedom{p_end}
{synopt:{cmd:e(MIM_lambda)}}vector of fraction of missing information (FMI){p_end}
{synopt:{cmd:e(MIM_r)}}vector of increase in variance due to missing information{p_end}
{syntab:{it:Scalars}}
{synopt:{cmd:e(MIM_dfmin)}}minimum of {cmd:e(}{cmd:MIM_dfvec}{cmd:)}{p_end}
{synopt:{cmd:e(MIM_dfmax)}}maximum of {cmd:e(}{cmd:MIM_dfvec}{cmd:)}{p_end}
{synopt:{cmd:e(MIM_Nmin)}}minimun number of observations used in estimation{p_end}
{synopt:{cmd:e(MIM_Nmax)}}maximum number of observations used in estimation{p_end}
{syntab:{it:Macros}}
{synopt:{cmd:e(MIM_m)}}number of imputed datasets used in estimation{p_end}
{synopt:{cmd:e(MIM_levels)}}values of {cmd:_mi_m} variable used in estimation{p_end}
{synopt:{cmd:e(MIM_prefix)}}value of {cmd:e(}{it:prefix}{cmd:)} returned by {it:command}{p_end}
{synopt:{cmd:e(MIM_prefix2)}}{cmd:mim}{p_end}
{synopt:{cmd:e(MIM_cmd)}}the name of the estimation command specified in {it:command}{p_end}
{synopt:{cmd:e(MIM_depvar)}}value of {cmd:e(depvar)} returned by {it:command}{p_end}
{synopt:{cmd:e(MIM_title)}}value of {cmd:e(title)} returned by {it:command}{p_end}
{synopt:{cmd:e(MIM_properties)}}value of {cmd:e(properties)} returned by {it:command}{p_end}
{synopt:{cmd:e(MIM_eform)}}value of {cmd:e(eform)} returned by {it:command}{p_end}
{syntab:{it:Additional results (returned when}{cmd: storebv}{it: option is specified)}}
{synopt:{cmd:e(b)}}equal to {cmd:e(MIM_Q)}{p_end}
{synopt:{cmd:e(V)}}equal to {cmd:e(MIM_T)}{p_end}
{synopt:{cmd:e(N)}}equal to {cmd:e(MIM_Nmin)}{p_end}
{synopt:{cmd:e(sample)}}equal to 1 for observations in the estimation sample, 0 otherwise{p_end}
{synopt:{cmd:e(cmd)}}equal to {cmd:e(MIM_cmd)}{p_end}
{synopt:{cmd:e(depvar)}}equal to {cmd:e(MIM_depvar)}{p_end}
{synopt:{cmd:e(df_r)}}equal to {cmd:e(MIM_dfmin)}{p_end}
{synopt:{cmd:e(properties)}}equal to {cmd:e(MIM_properties)}{p_end}
{synoptline}
{p2colreset}{...}
{title:Examples}
{pstd}
Examples and accompanying remarks are given under the headings
{it:Model fitting}, {it:Data manipulation}, {it:Post-estimation},
{it:Replay of estimation results [advanced]}, {it:Utility commands},
and {it:Combining estimates using Rubin's rules}.
{marker fitting}{...}
{title:Model fitting}
{pstd}
When invoked for model fitting, {cmd:mim} applies {it:command} to each of the
imputed datasets in the current MIM dataset, and then combines the individual
estimates using Rubin's rules for multiple-imputation-based inferences. In
most cases fitting a statistical model to a multiply-imputed dataset with
{cmd:mim} is simply a matter of loading the MIM-format dataset into Stata and
executing the desired estimation command, prefixing it with the {cmd:mim}
prefix. Several examples are provided below.
{phang}
{cmd:. use mymimdataset1, clear}
{p_end}
{phang}
{cmd:. mim: regress y x1 x2 x3 x4}
{p_end}
{phang}
{cmd:. use mymimdataset2, clear}
{p_end}
{phang}
{cmd:. mim: logistic y x1 x2, coef}
{p_end}
{phang}
{cmd:. use mymimdataset3, clear}
{p_end}
{phang}
{cmd:. xi: mim: glm low age lwt i.race smoke ptl ht ui, f(bin) l(logit) le(90)}
{p_end}
{phang}
{cmd:. xi: mim: stepwise, pr(0.05): glm low age lwt (i.race) smoke ptl ht ui, f(bin) l(logit) le(90)}
{p_end}
{phang}
{cmd:. use mymimdataset4, clear}
{p_end}
{phang}
{cmd:. mim: svy: proportion heartatk}
{p_end}
{phang}
{cmd:. mim: svy: logistic heartatk age weight height}
{p_end}
{phang}
{cmd:. mim, noi: svy jackknife, nodots: logit highbp height weight age age2 female black, or}
{p_end}
{phang}
{cmd:. use mymimdataset5, clear}
{p_end}
{phang}
{cmd:. mim: xtmixed gsp private emp water other unemp || region: R.state, l(90)}
{p_end}
{pstd}
Additionally, other Stata estimation commands may by fitted to a MIM dataset using the
{cmd:category(fit)} option of {cmd:mim}. Two examples are given below.
{phang}
{cmd:. use mymimdataset6, clear}
{p_end}
{phang}
{cmd:. mim, cat(fit): mvprobit (private = years logptax loginc) (vote=years logptax loginc), nolog}
{p_end}
{phang}
{cmd:. use mymimdataset7, clear}
{p_end}
{phang}
{cmd:. mim, cat(fit): MyNewCommand y x1 x2}
{p_end}
{title:Data manipulation}
{pstd}
The stacked dataset format used by {cmd:mim} allows simple data manipulation
such as generating and replacing variables to be performed using existing
Stata commands. More complex data manipulation tasks, particularly those that
alter the number of observations in each of the imputed datasets, usually
require more detailed programming. For convenience, three common tasks,
namely reshaping, appending and merging datasets, can be accomplished by
prefixing the relevant command with {cmd:mim}. The first two are
straightforward, and in most instances will be applied by simply prefixing
the usual syntax with {cmd:mim}.
{phang}
{cmd:. use mymimdataset7, clear}
{p_end}
{phang}
{cmd:. mim: reshape wide income, i(id) j(year)}
{p_end}
{phang}
{cmd:. mim: reshape long}
{p_end}
{phang}
{cmd:. use mymimdataset8, clear}
{p_end}
{phang}
{cmd:. mim: append using mymimdataset9}
{p_end}
{pstd}
Merging two {cmd:mim}-compatible datasets requires a little further
explanation, since it requires that the {cmd:sortorder} option be specified to
{cmd:mim}. This option is necessary so that {cmd:mim} can generate a new
{cmd:_mi_id} variable once merging is complete. For example, suppose that
{cmd:mymimdataset10} is a {cmd:mim}-compatible dataset containing patient
details, with each patient having a unique {cmd:id}, and {cmd:mymimdataset11}
is a second stacked dataset containing additional longitudinal measurements on
each patient, with each measurement uniquely identified by the two variables
{cmd:id time}. Merging these data into a single dataset would usually be
accomplished by a match-merge on the {cmd:id} variable. However, once merging
is complete, the observations in the merged dataset are determined by the pair
of variables {cmd:id} and {cmd:time}. Using {cmd:mim} the merge would be
accomplished as follows:
{phang}
{cmd:. use mymimdataset10, clear}
{p_end}
{phang}
{cmd:. mim, sortorder(id time): merge id using mymimdataset11}
{p_end}
{pstd}
Additionally, other Stata commands that either manipulate a single dataset or a
master/using pair of datasets may by applied to a multiply-imputed dataset
using the {cmd:category} option of {cmd:mim}. This is most likely to be of
interest when {it:command} is a user-written program designed to accomplish a
project-specific task.
{phang}
{cmd:. use mymimdataset12, clear}
{p_end}
{phang}
{cmd:. mim, category(manip) so(id): mystatacmd x1 x2 x3}
{p_end}
{marker postestimation}{...}
{title:Post-estimation}
{pstd}
In general Stata's standard post-estimation methods cannot be directly applied
with multiply-imputed data. Methods relying on likelihood comparisons
({cmd:lrtest}) are not applicable because multiple imputation does not
involve calculation of likelihood functions for the data. Furthermore,
application of a post-estimation command directly to the multiple-imputation
estimates will not in general produce valid simultaneous inferences for multiple
parameters, since applying Rubin's rules to the vector of parameter estimates
and their associated variance-covariance matrices does not work reliably
(Li et al, 1991). Performing inferences for target parameters that are scalar
(unidimensional) is however easily accomplished using Rubin's rules, and this
has enabled us to create multiple-imputation versions of {cmd:lincom} and
{cmd:predict}. In addition, we have implemented the method of Li et al (1991)
to create a {cmd:mim}-specific version of {marker testparm}{cmd:testparm},
which allows the testing of null hypotheses relating to a vector of parameters.
Examples of the use of {marker lincom}{cmd:mim: lincom}, {cmd:mim: testparm}
and {cmd:mim: predict} are given below. For other post-estimation tasks see the
additional remarks under
{help mim##replay:Replay of estimation results [advanced]}.
{pstd}
Warning: {cmd:mim: lincom} has an anomalous feature.
Stata's {cmd:lincom} following {cmd:logistic} behaves atypically
compared with other Stata regression commands such as {cmd:stcox}. If you
wish to get odds ratio estimates with {cmd:mim: logistic} followed by
{cmd:mim: lincom}, you should specify the model as {cmd:mim: logit ..., or}
and the lincom command as {cmd:mim: lincom} {it:exp}{cmd:, or}.
{phang}
{cmd:. use mymimdataset2, clear}
{p_end}
{phang}
{cmd:. mim: logit y x1 x2}
{p_end}
{phang}
{cmd:. mim: lincom x1 + 2 * x2}
{p_end}
{phang}
{cmd:. mim: lincom x1 + x2, or}
{p_end}
{phang}
{cmd:. mim: testparm _all}
{p_end}
{phang}
{cmd:. mim: predict yhat, xb }
{p_end}
{phang}
{cmd:. mim: predict yhatse, stdp}
{p_end}
{marker replay}{...}
{title:Replay of estimation results [advanced]}
{pstd}
Multiple-imputation estimates may be replayed by simply typing {cmd:mim} at the
command line. If the estimates for a given imputed dataset have previously
been called up using the {opt j(#)} option, the overall (Rubin's rules)
estimates may be re-displayed by typing {cmd:mim, storebv} or
{cmd:mim, clearbv}. A {opt level(#)} option and any {opt eform} options
supported by {it:command} may be specified during replay.
{phang}
{cmd:. use mymimdataset2, clear}
{p_end}
{phang}
{cmd:. mim: logit y x1 x2}
{p_end}
{phang}
{cmd:. mim, or l(90)}
{p_end}
{pstd}
Multiple-imputation estimates may be copied into {cmd:e(b)}, {cmd:e(V)} etc.
by specifying the {cmd:storebv} option during replay. Note that use of
multiple-imputation estimates in this way is at the user's descretion, and
validity of the results is not guaranteed. In particular, forcing the
multiple-imputation estimates into {cmd:e(b)} and {cmd:e(V)} allows
application of a Stata post-estimation command directly to the
multiple-imputation estimates. While this may be valid in specific cases,
it is certainly not valid in general (see
{help mim##postestimation:Post-estimation} for additional comments).
{phang}
{cmd:. mim, storebv}
{p_end}
{pstd}
(Note that the {cmd:storebv} option may also be specified during model fitting.)
{pstd}
Alternatively, by specifying the {opt j(#)} option of {cmd:mim}, the estimates
corresponding to the application of {it:command} to one of the individual
imputed datasets are copied into their usual place in {cmd:e()} (that is, into
{cmd:e(b)}, {cmd:e(V)} etc.). {it:command} can also be replayed directly in
this situation, for example
{phang}
{cmd:. mim: logit y x1 x2}
{p_end}
{phang}
{cmd:. mim, j(1)}
{p_end}
{phang}
{cmd:. logit, or}
{p_end}
{pstd}
displays the estimated odds ratios for imputation #1.
{pstd}
The facility to replay individual estimates has been incorporated with
extensibility in mind, particularly with regard to post-estimation. The most
likely application is to loop over the individual estimates, replaying and
capturing necessary quantities from each set of results in turn, and then
combining these in some way, where the standard approach for simple scalar
estimation would be to use Rubin's rules.
{phang}
{cmd:. use mymimdataset2, clear}
{p_end}
{phang}
{cmd:. mim: logit y x1 x2}
{p_end}
{phang}
{cmd:. local levels `"`e(MIM_levels)'"'}
{p_end}
{phang}
{cmd:. foreach j of local levels {c -(}}
{p_end}
{phang}
{cmd:. {space 3}quietly mim, j(`j')}
{p_end}
{phang}
{cmd:. {space 3}{it:... apply some post-estimation command or capture some stored results here ...}}
{p_end}
{phang}
{cmd:. {c )-}}
{p_end}
{phang}
{cmd:. {it:combine results from individual estimations using Rubin's rules ...}}
{p_end}
{pstd}
Finally, to avoid inadvertent application of a Stata post-estimation command
to estimates copied into {cmd:e(b)}, {cmd:e(V)} etc. using either the
{opt j(#)} or {cmd:storebv} option, the {cmd:clearbv} option is provided to
allow one to clear these estimates when finished (without losing the multiple
imputation estimates from memory). It is recommended always to make use of
this facility.
{phang}
{cmd:. mim, clearbv}
{p_end}
{marker utility}{...}
{title:Utility commands}
{pstd}
The {cmd:check} command provides a detailed integrity check of a multiply
imputed dataset in stacked format. The main checks are that non-missing
values must be constant across imputed datasets and that all missing values
must have been imputed. Note that the utility commands are only applicable
when the original dataset with missing values has been included in the stacked
dataset (see {help mim##format:MIM dataset format}).
{phang}
{cmd:. use mymimdataset12, clear}
{p_end}
{phang}
{cmd:. mim: check}
{p_end}
{phang}
Alternatively, the check can be restricted to selected variables.
{phang}
{cmd:. mim: check x1 x2 x3 x4 x5}
{p_end}
{pstd}
The {cmd:genmiss} command generates a missing indicator variable for a specified variable.
{phang}
{cmd:. mim: genmiss x1}
{p_end}
{pstd}
In this case the generated indicator variable is called {cmd:_mim_x1} (and in
general the naming convention used is to prefix {it:varname} with {it:_mim_}).
{marker combine_estimates}{...}
{title:Combining estimates using Rubin's rules}
{pstd}
Some simple examples of {cmd:mim, category(combine)} may help to clarify how to use
this powerful facility. One small point to note: the degrees of freedom used in
calculating the t-statistic for confidence intervals are slightly larger according to
{cmd:mim, category(combine)} than to {cmd:mim} when fitting regression models.
The result is that
{cmd:mim, category(combine)} gives slightly narrower confidence intervals.
{pstd}{ul:1. The mean of {cmd:x} with its SE and 95% CI computed in different ways}
{pmore}Using the default calculating tool ({cmd:statsby}):
{pmore}{cmd:. mim, cat(combine) est(_b[x]) se(_se[x]) : mean x}{p_end}
{pmore}{cmd:. mim, cat(combine) est(_b[_cons]) se(_se[_cons]) : regress x}{p_end}
{pmore}{cmd:. mim, cat(combine) est(r(mean)) se(sqrt(r(Var)/r(N))) : ameans x}{p_end}
{pmore}Note the use of an expression for the SE of the mean, namely
{hi:se(sqrt(r(Var)/r(N)))}. {cmd:statsby} allows this flexibility but
{cmd:byvar} doesn't.
{pmore}Using the alternative calculating tool ({cmd:byvar}):
{pmore}{cmd:. mim, cat(combine) byvar est(b(x)) se(se(x)) : mean x}{p_end}
{pmore}{cmd:. mim, cat(combine) byvar est(b(_cons)) se(se(_cons)) : regress x}{p_end}
{pstd}{ul:2. Area under a ROC curve}
{pmore}
The aim is to fit a logistic regression of {cmd:y} on {cmd:x1} and {cmd:x2},
and compute the AUROC (area under the ROC curve) for the resulting linear predictor
in each imputation, combine the AUROC values across imputations and report
the mean AUROC with its SE and 95% CI.
{pmore}{cmd:. mim: logit y x1 x2}{p_end}
{pmore}{cmd:. mim: predict xb}{p_end}
{pmore}{cmd:. mim, cat(combine) est(r(area)) se(r(se)) : roctab y xb}{p_end}
{pmore}{cmd:. mim, cat(combine) byvar est(r(area)) se(r(se)) : roctab y xb}{p_end}
{pmore}
We have noticed that {cmd:byvar} is substantially faster than {cmd:statsby} in some
examples; in the {cmd:roctab} example just given, it takes one third of the time
taken by {cmd:statsby}. The reason appears to be that {cmd:statsby} executes
{it:stata_cmd} first for the entire dataset, then for each imputation, whereas
{cmd:byvar} only does it for each imputation.
{pstd}{ul:3. Using a sequence of Stata commands}
{pmore}
Note the feature of {cmd:byvar} that {it:stata_cmd} can
be a sequence of Stata commands, separated by {cmd:@}. The feature
is not available with {cmd:statsby}.
{pmore}
For example,
the mean AUROC in the second example above could be obtained
by the following single command:
{pmore}{cmd:. mim, cat(combine) byvar est(r(area)) : logit y x1 x2 @ lroc, nograph}{p_end}
{pmore}
Since {cmd:lroc} does not return the SE of the AUROC, the {opt se()}
option of {cmd:mim, category(combine)} is omitted and only the mean AUROC is reported.
{pstd}{ul:4. Combining estimates of a parameter from a multi-equation model}
{pmore}This is purely a pedagogic example, since {cmd:mim} reports combined results
for all parameters of a multi-equation model anyway:
{phang2}{cmd:. mim, cat(combine) est([ln_p]_b[_cons]) se([ln_p]_se[_cons]) : streg x1 x2, distribution(weibull)}{p_end}
{title:Authors}
{pstd}
John C. Galati & John B. Carlin, Clinical Epidemiology & Biostatistics Unit
Murdoch Children’s Research Institute & University of Melbourne{break}
john.carlin@mcri.edu.au
{pstd}
Patrick Royston, MRC Clinical Trials Unit, London.{break}
pr@ctu.mrc.ac.uk
{title:References}
{phang}
Carlin JB, Galati JC and Royston P. 2008.
A new framework for managing and analyzing multiply imputed data in Stata.
{it:Stata Journal} 8(1): 49-67.
{phang}
Carlin JB, Li N, Greenwood P and Coffey C. 2003.
Tools for analyzing multiple imputed
datasets. {it:Stata Journal} 3(3): 226-244.
{phang}
Efron B, Gong G. 1983. A leisurely look at the bootstrap, the jackknife,
and cross-validation. {it:The American Statistician} 37: 36-48.
{phang}
Li KH, Raghunathan TE, Rubin DB. 1991. Large-sample significance levels from
multiply-imputed data using moment-based statistics and an F reference distribution.
{it:Journal of the American Statistical Association} 86: 1065-1073.
{phang}
Royston P. 2004. Multiple imputation of missing values.
{it:Stata Journal} 4(3): 227-241.
{phang}
Royston P. 2005. Multiple imputation of missing values: update.
{it:Stata Journal} 5(2): 188-201.
{phang}
Royston P. 2005. Multiple imputation of missing values:
update of ice. {it:Stata Journal} 5(4): 527-536.
{phang}
Royston P. 2007. Multiple imputation of missing values: further
update of ice, with an emphasis on interval
censoring. {it:Stata Journal} 7(4): 445–464.
{phang}
Royston P, Carlin JB and White IR. 2009. Multiple imputation of missing values:
new features for mim. {it:Stata Journal} to appear.
{title:Also see}
{pstd}
Online: help for {help mim}, {help mimstack},
{help mi estimate} (if Stata 11 installed)
{p_end}