{smcl}
{* 30sep2004}{...}
{hline}
help for {hi:micombine7}{right:Patrick Royston}
{hline}
{title:Estimation of regression models with multiply imputed samples}
{p 8 17 2}
{cmd:micombine7}
{it:regression_cmd}
[{it:yvar}]
{it:covarlist}
[{cmd:if} {it:exp}]
[{cmd:in} {it:range}]
[{it:weight}]
[
{cmd:,}
{cmdab:nocons:tant}
{cmdab:det:ail}
{cmdab:e:form(}{it:string}{cmd:)}
{cmdab:g:enxb(}{it:newvarname}{cmd:)}
{cmdab:imp:id(}{it:varname}{cmd:)}
{cmd:lrr}
{it:regression_cmd_options}
]
{p 4 4 2}
where
{p 8 8 2}
{it:regression_cmd} may be
{help clogit},
{help cnreg},
{help glm},
{help logistic},
{help logit},
{help mlogit},
{help ologit},
{help oprobit},
{help poisson},
{help probit},
{help qreg},
{help regress},
{help rreg},
{help stcox},
{help streg},
or
{help xtgee}.
{p 4 4 2}
{cmd:micombine7} shares only a small subset of the features of all estimation commands
(see help {help estimates}); see {it:Remarks}.
{p 4 4 2}
All weight types supported by {it:regression_cmd} are allowed; see help
{help weights}.
{title:Description}
{p 4 4 2}
{cmd:micombine7} estimates the parameters of a regression model whose
type is determined by {it:regression_cmd}. Parameter estimates are combined
across several replicates obtained previously by multiple imputation,
e.g. by using {help mvis7} to create a file of imputed data.
See {it:Remarks} for a brief account of how {cmd:micombine7} combines
the estimates and obtains standard errors.
{title:Options}
{p 4 8 2}
{cmd:detail} gives details of the regression model for each imputation.
{p 4 8 2}
{cmd:eform(}{it:string}{cmd:)} indicates that the exponentiated
form of the coefficients is to be output and reporting of the constant is to
be suppressed; {it:string} is used to label the exponentiated coefficients.
{p 4 8 2}
{cmd:genxb(}{it:newvarname}{cmd:)} creates {it:newvarname} to hold the
linear predictor from each regression model, averaged over all the
imputations.
{p 4 8 2}
{cmd:impid(}{it:varname}{cmd:)} specifies that {it:varname} is the variable
identifying the imputations. The number of imputations is determined as
the number of unique values of {it:varname}. Default {it:varname}: {cmd:_j}.
{p 4 8 2}
{cmd:lrr} specifies that the Li-Raghunathan-Rubin (LRR) robust estimate of the
variance-covariance matrix of the regression coefficients be used.
{p 4 8 2}
{cmd:noconstant} suppresses the regression constant in all regressions.
{p 4 8 2}
{it:regression_cmd_options} may be any of the options appropriate to
{it:regression_cmd}.
{title:Remarks}
{p 4 4 2}
Details of statistical inference from multiple imputed datasets are nicely described
in a recent Stata Journal article by John Carlin and colleagues (Carlin et al, 2003).
Here, with due acknowledgment to John, I give an edited version of Section 2 of his article.
{p 4 4 2}
A simple method of combining estimates from several models was derived by Rubin (1987).
Suppose initially that primary interest lies in estimating a scalar quantity, Q.
Here, Q is a regression coefficient, for example, the log hazard ratio in a
proportional hazards model. Suppose that we have imputed m complete datasets
using an appropriate model. In each dataset, standard complete-data methods
are used to obtain an estimate of Q with an associated standard error.
Let Q(k) and U(k) denote the point estimate and variance respectively from the kth
(k = 1, 2, ... , m) dataset. The point estimate Q^ of Q from multiple imputation
is simply the arithmetic mean of Q(1),...,Q(k).
{p 4 4 2}
Obtaining a valid standard error for this estimate of Q^ requires combining information
on within-imputation and between-imputation variation. The latter is important in
reflecting uncertainty due to variability between imputation samples. First,
a within-imputation variance component, W, is obtained as the mean of the
complete-data variance estimates, Q(1),....,Q(k). Second, a between-imputation variance
component, B, is calculated as the sum of squares of Q(1),....,Q(k) about Q^,
divided by m-1. The (total) variance T of Q^ is given by
{p 8 12 2}
T = W + B * (1 + 1/m)
{p 4 4 2}
Rubin (1987) showed that (Q - Q^)/sqrt(T) is distributed approximately
as Student's t on nu degrees of freedom, where
{p 8 12 2}
nu = (m - 1) * (1 + W /(B * (1 + 1/m)))^2
{p 4 4 2}
The (1 + 1/m) term in these expressions indicates that it is not necessary to
a create large number of imputed datasets, particularly when B is much smaller
than W. The condition will be satisfied unless there is much missing data and
the parameter estimates within each dataset are very precise.
{title:Examples}
{p 4 8 2}{cmd:. mvis7 y x1 x2 x3 using imp, m(10) genmiss(m_)}{p_end}
{p 4 8 2}{cmd:. use imp, clear}{p_end}
{p 4 8 2}{cmd:. micombine7 regress y x1 x2 x3}{p_end}
{p 4 8 2}{cmd:. stset time, failure(cens)}{p_end}
{p 4 8 2}{cmd:. micombine7 stcox x1 x2 x3, genxb(index)}{p_end}
{title:Author}
{p 4 4 2}
Patrick Royston, MRC Clinical Trials Unit, London.
patrick.royston@ctu.mrc.ac.uk
{title:References}
{p 4 8 2}
Carlin, J. B., N. Li, P. Greenwood, and C. Coffey. 2003.
Tools for analyzing multiple imputed datasets. {it:Stata Journal} {cmd:3(3)}: 226-244.
{p 4 8 2}
Li, K., T. Raghunathan, and D. Rubin. 1991. Large sample significance levels from
multiply-imputed data using moment-based statistics and an F reference distribution.
{it:Journal of the American Statistical Association} {cmd:86}: 1065-1073.
{p 4 8 2}
Royston P. 2004. Multiple imputation of missing values.
{it:Stata Journal} {cmd:4(3)}:227-241.
{p 4 8 2}
Rubin, D. 1987. Multiple Imputation for Nonresponse in Surveys. New York: John Wiley &
Sons.
{p 4 8 2}
Schafer, J. 1997. Analysis of Incomplete Multivariate Data. London: Chapman & Hall.
{p 4 8 2}
van Buuren, S., H. C. Boshuizen and D. L. Knook. 1999. Multiple imputation of
missing blood pressure covariates in survival analysis.
{it:Statistics in Medicine} {cmd:18}:681-694. (Also see http://www.multiple-imputation.com.)
{title:Also see}
{p 4 13 2}
Online: help for {help mvis7}.