Estimation of regression models with multiply imputed samples
micombine7 regression_cmd [yvar] covarlist [if exp] [in range] [weight] [ , noconstant detail eform(string) genxb(newvarname) impid(varname) lrr regression_cmd_options ]
where
regression_cmd may be clogit, cnreg, glm, logistic, logit, mlogit, ologit, oprobit, poisson, probit, qreg, regress, rreg, stcox, streg, or xtgee.
micombine7 shares only a small subset of the features of all estimation commands (see help estimates); see Remarks.
All weight types supported by regression_cmd are allowed; see help weights.
Description
micombine7 estimates the parameters of a regression model whose type is determined by regression_cmd. Parameter estimates are combined across several replicates obtained previously by multiple imputation, e.g. by using mvis7 to create a file of imputed data. See Remarks for a brief account of how micombine7 combines the estimates and obtains standard errors.
Options
detail gives details of the regression model for each imputation.
eform(string) indicates that the exponentiated form of the coefficients is to be output and reporting of the constant is to be suppressed; string is used to label the exponentiated coefficients.
genxb(newvarname) creates newvarname to hold the linear predictor from each regression model, averaged over all the imputations.
impid(varname) specifies that varname is the variable identifying the imputations. The number of imputations is determined as the number of unique values of varname. Default varname: _j.
lrr specifies that the Li-Raghunathan-Rubin (LRR) robust estimate of the variance-covariance matrix of the regression coefficients be used.
noconstant suppresses the regression constant in all regressions.
regression_cmd_options may be any of the options appropriate to regression_cmd.
Remarks
Details of statistical inference from multiple imputed datasets are nicely described in a recent Stata Journal article by John Carlin and colleagues (Carlin et al, 2003). Here, with due acknowledgment to John, I give an edited version of Section 2 of his article.
A simple method of combining estimates from several models was derived by Rubin (1987). Suppose initially that primary interest lies in estimating a scalar quantity, Q. Here, Q is a regression coefficient, for example, the log hazard ratio in a proportional hazards model. Suppose that we have imputed m complete datasets using an appropriate model. In each dataset, standard complete-data methods are used to obtain an estimate of Q with an associated standard error. Let Q(k) and U(k) denote the point estimate and variance respectively from the kth (k = 1, 2, ... , m) dataset. The point estimate Q^ of Q from multiple imputation is simply the arithmetic mean of Q(1),...,Q(k).
Obtaining a valid standard error for this estimate of Q^ requires combining information on within-imputation and between-imputation variation. The latter is important in reflecting uncertainty due to variability between imputation samples. First, a within-imputation variance component, W, is obtained as the mean of the complete-data variance estimates, Q(1),....,Q(k). Second, a between-imputation variance component, B, is calculated as the sum of squares of Q(1),....,Q(k) about Q^, divided by m-1. The (total) variance T of Q^ is given by
T = W + B * (1 + 1/m)
Rubin (1987) showed that (Q - Q^)/sqrt(T) is distributed approximately as Student's t on nu degrees of freedom, where
nu = (m - 1) * (1 + W /(B * (1 + 1/m)))^2
The (1 + 1/m) term in these expressions indicates that it is not necessary to a create large number of imputed datasets, particularly when B is much smaller than W. The condition will be satisfied unless there is much missing data and the parameter estimates within each dataset are very precise.
Examples
. mvis7 y x1 x2 x3 using imp, m(10) genmiss(m_) . use imp, clear . micombine7 regress y x1 x2 x3 . stset time, failure(cens) . micombine7 stcox x1 x2 x3, genxb(index)
Author
Patrick Royston, MRC Clinical Trials Unit, London. patrick.royston@ctu.mrc.ac.uk
References
Carlin, J. B., N. Li, P. Greenwood, and C. Coffey. 2003. Tools for analyzing multiple imputed datasets. Stata Journal 3(3): 226-244.
Li, K., T. Raghunathan, and D. Rubin. 1991. Large sample significance levels from multiply-imputed data using moment-based statistics and an F reference distribution. Journal of the American Statistical Association 86: 1065-1073.
Royston P. 2004. Multiple imputation of missing values. Stata Journal 4(3):227-241.
Rubin, D. 1987. Multiple Imputation for Nonresponse in Surveys. New York: John Wiley & Sons.
Schafer, J. 1997. Analysis of Incomplete Multivariate Data. London: Chapman & Hall.
van Buuren, S., H. C. Boshuizen and D. L. Knook. 1999. Multiple imputation of missing blood pressure covariates in survival analysis. Statistics in Medicine 18:681-694. (Also see http://www.multiple-imputation.com.)
Also see