{smcl} {* 30sep2004}{...} {hline} help for {hi:micombine7}{right:Patrick Royston} {hline} {title:Estimation of regression models with multiply imputed samples} {p 8 17 2} {cmd:micombine7} {it:regression_cmd} [{it:yvar}] {it:covarlist} [{cmd:if} {it:exp}] [{cmd:in} {it:range}] [{it:weight}] [ {cmd:,} {cmdab:nocons:tant} {cmdab:det:ail} {cmdab:e:form(}{it:string}{cmd:)} {cmdab:g:enxb(}{it:newvarname}{cmd:)} {cmdab:imp:id(}{it:varname}{cmd:)} {cmd:lrr} {it:regression_cmd_options} ] {p 4 4 2} where {p 8 8 2} {it:regression_cmd} may be {help clogit}, {help cnreg}, {help glm}, {help logistic}, {help logit}, {help mlogit}, {help ologit}, {help oprobit}, {help poisson}, {help probit}, {help qreg}, {help regress}, {help rreg}, {help stcox}, {help streg}, or {help xtgee}. {p 4 4 2} {cmd:micombine7} shares only a small subset of the features of all estimation commands (see help {help estimates}); see {it:Remarks}. {p 4 4 2} All weight types supported by {it:regression_cmd} are allowed; see help {help weights}. {title:Description} {p 4 4 2} {cmd:micombine7} estimates the parameters of a regression model whose type is determined by {it:regression_cmd}. Parameter estimates are combined across several replicates obtained previously by multiple imputation, e.g. by using {help mvis7} to create a file of imputed data. See {it:Remarks} for a brief account of how {cmd:micombine7} combines the estimates and obtains standard errors. {title:Options} {p 4 8 2} {cmd:detail} gives details of the regression model for each imputation. {p 4 8 2} {cmd:eform(}{it:string}{cmd:)} indicates that the exponentiated form of the coefficients is to be output and reporting of the constant is to be suppressed; {it:string} is used to label the exponentiated coefficients. {p 4 8 2} {cmd:genxb(}{it:newvarname}{cmd:)} creates {it:newvarname} to hold the linear predictor from each regression model, averaged over all the imputations. {p 4 8 2} {cmd:impid(}{it:varname}{cmd:)} specifies that {it:varname} is the variable identifying the imputations. The number of imputations is determined as the number of unique values of {it:varname}. Default {it:varname}: {cmd:_j}. {p 4 8 2} {cmd:lrr} specifies that the Li-Raghunathan-Rubin (LRR) robust estimate of the variance-covariance matrix of the regression coefficients be used. {p 4 8 2} {cmd:noconstant} suppresses the regression constant in all regressions. {p 4 8 2} {it:regression_cmd_options} may be any of the options appropriate to {it:regression_cmd}. {title:Remarks} {p 4 4 2} Details of statistical inference from multiple imputed datasets are nicely described in a recent Stata Journal article by John Carlin and colleagues (Carlin et al, 2003). Here, with due acknowledgment to John, I give an edited version of Section 2 of his article. {p 4 4 2} A simple method of combining estimates from several models was derived by Rubin (1987). Suppose initially that primary interest lies in estimating a scalar quantity, Q. Here, Q is a regression coefficient, for example, the log hazard ratio in a proportional hazards model. Suppose that we have imputed m complete datasets using an appropriate model. In each dataset, standard complete-data methods are used to obtain an estimate of Q with an associated standard error. Let Q(k) and U(k) denote the point estimate and variance respectively from the kth (k = 1, 2, ... , m) dataset. The point estimate Q^ of Q from multiple imputation is simply the arithmetic mean of Q(1),...,Q(k). {p 4 4 2} Obtaining a valid standard error for this estimate of Q^ requires combining information on within-imputation and between-imputation variation. The latter is important in reflecting uncertainty due to variability between imputation samples. First, a within-imputation variance component, W, is obtained as the mean of the complete-data variance estimates, Q(1),....,Q(k). Second, a between-imputation variance component, B, is calculated as the sum of squares of Q(1),....,Q(k) about Q^, divided by m-1. The (total) variance T of Q^ is given by {p 8 12 2} T = W + B * (1 + 1/m) {p 4 4 2} Rubin (1987) showed that (Q - Q^)/sqrt(T) is distributed approximately as Student's t on nu degrees of freedom, where {p 8 12 2} nu = (m - 1) * (1 + W /(B * (1 + 1/m)))^2 {p 4 4 2} The (1 + 1/m) term in these expressions indicates that it is not necessary to a create large number of imputed datasets, particularly when B is much smaller than W. The condition will be satisfied unless there is much missing data and the parameter estimates within each dataset are very precise. {title:Examples} {p 4 8 2}{cmd:. mvis7 y x1 x2 x3 using imp, m(10) genmiss(m_)}{p_end} {p 4 8 2}{cmd:. use imp, clear}{p_end} {p 4 8 2}{cmd:. micombine7 regress y x1 x2 x3}{p_end} {p 4 8 2}{cmd:. stset time, failure(cens)}{p_end} {p 4 8 2}{cmd:. micombine7 stcox x1 x2 x3, genxb(index)}{p_end} {title:Author} {p 4 4 2} Patrick Royston, MRC Clinical Trials Unit, London. patrick.royston@ctu.mrc.ac.uk {title:References} {p 4 8 2} Carlin, J. B., N. Li, P. Greenwood, and C. Coffey. 2003. Tools for analyzing multiple imputed datasets. {it:Stata Journal} {cmd:3(3)}: 226-244. {p 4 8 2} Li, K., T. Raghunathan, and D. Rubin. 1991. Large sample significance levels from multiply-imputed data using moment-based statistics and an F reference distribution. {it:Journal of the American Statistical Association} {cmd:86}: 1065-1073. {p 4 8 2} Royston P. 2004. Multiple imputation of missing values. {it:Stata Journal} {cmd:4(3)}:227-241. {p 4 8 2} Rubin, D. 1987. Multiple Imputation for Nonresponse in Surveys. New York: John Wiley & Sons. {p 4 8 2} Schafer, J. 1997. Analysis of Incomplete Multivariate Data. London: Chapman & Hall. {p 4 8 2} van Buuren, S., H. C. Boshuizen and D. L. Knook. 1999. Multiple imputation of missing blood pressure covariates in survival analysis. {it:Statistics in Medicine} {cmd:18}:681-694. (Also see http://www.multiple-imputation.com.) {title:Also see} {p 4 13 2} Online: help for {help mvis7}.