------------------------------------------------------------------------------- help formicombine7Patrick Royston -------------------------------------------------------------------------------

Estimation of regression models with multiply imputed samples

micombine7regression_cmd[yvar]covarlist[ifexp] [inrange] [weight] [,noconstantdetaileform(string)genxb(newvarname)impid(varname)lrrregression_cmd_options]where

regression_cmdmay be clogit, cnreg, glm, logistic, logit, mlogit, ologit, oprobit, poisson, probit, qreg, regress, rreg, stcox, streg, or xtgee.

micombine7shares only a small subset of the features of all estimation commands (see help estimates); seeRemarks.All weight types supported by

regression_cmdare allowed; see help weights.

Description

micombine7estimates the parameters of a regression model whose type is determined byregression_cmd. Parameter estimates are combined across several replicates obtained previously by multiple imputation, e.g. by using mvis7 to create a file of imputed data. SeeRemarksfor a brief account of howmicombine7combines the estimates and obtains standard errors.

Options

detailgives details of the regression model for each imputation.

eform(string)indicates that the exponentiated form of the coefficients is to be output and reporting of the constant is to be suppressed;stringis used to label the exponentiated coefficients.

genxb(newvarname)createsnewvarnameto hold the linear predictor from each regression model, averaged over all the imputations.

impid(varname)specifies thatvarnameis the variable identifying the imputations. The number of imputations is determined as the number of unique values ofvarname. Defaultvarname:_j.

lrrspecifies that the Li-Raghunathan-Rubin (LRR) robust estimate of the variance-covariance matrix of the regression coefficients be used.

noconstantsuppresses the regression constant in all regressions.

regression_cmd_optionsmay be any of the options appropriate toregression_cmd.

RemarksDetails of statistical inference from multiple imputed datasets are nicely described in a recent Stata Journal article by John Carlin and colleagues (Carlin et al, 2003). Here, with due acknowledgment to John, I give an edited version of Section 2 of his article.

A simple method of combining estimates from several models was derived by Rubin (1987). Suppose initially that primary interest lies in estimating a scalar quantity, Q. Here, Q is a regression coefficient, for example, the log hazard ratio in a proportional hazards model. Suppose that we have imputed m complete datasets using an appropriate model. In each dataset, standard complete-data methods are used to obtain an estimate of Q with an associated standard error. Let Q(k) and U(k) denote the point estimate and variance respectively from the kth (k = 1, 2, ... , m) dataset. The point estimate Q^ of Q from multiple imputation is simply the arithmetic mean of Q(1),...,Q(k).

Obtaining a valid standard error for this estimate of Q^ requires combining information on within-imputation and between-imputation variation. The latter is important in reflecting uncertainty due to variability between imputation samples. First, a within-imputation variance component, W, is obtained as the mean of the complete-data variance estimates, Q(1),....,Q(k). Second, a between-imputation variance component, B, is calculated as the sum of squares of Q(1),....,Q(k) about Q^, divided by m-1. The (total) variance T of Q^ is given by

T = W + B * (1 + 1/m)

Rubin (1987) showed that (Q - Q^)/sqrt(T) is distributed approximately as Student's t on nu degrees of freedom, where

nu = (m - 1) * (1 + W /(B * (1 + 1/m)))^2

The (1 + 1/m) term in these expressions indicates that it is not necessary to a create large number of imputed datasets, particularly when B is much smaller than W. The condition will be satisfied unless there is much missing data and the parameter estimates within each dataset are very precise.

Examples

. mvis7 y x1 x2 x3 using imp, m(10) genmiss(m_). use imp, clear. micombine7 regress y x1 x2 x3. stset time, failure(cens). micombine7 stcox x1 x2 x3, genxb(index)

AuthorPatrick Royston, MRC Clinical Trials Unit, London. patrick.royston@ctu.mrc.ac.uk

ReferencesCarlin, J. B., N. Li, P. Greenwood, and C. Coffey. 2003. Tools for analyzing multiple imputed datasets.

Stata Journal3(3): 226-244.Li, K., T. Raghunathan, and D. Rubin. 1991. Large sample significance levels from multiply-imputed data using moment-based statistics and an F reference distribution.

Journal of the American StatisticalAssociation86: 1065-1073.Royston P. 2004. Multiple imputation of missing values.

Stata Journal4(3):227-241.Rubin, D. 1987. Multiple Imputation for Nonresponse in Surveys. New York: John Wiley & Sons.

Schafer, J. 1997. Analysis of Incomplete Multivariate Data. London: Chapman & Hall.

van Buuren, S., H. C. Boshuizen and D. L. Knook. 1999. Multiple imputation of missing blood pressure covariates in survival analysis.

Statistics in Medicine18:681-694. (Also see http://www.multiple-imputation.com.)

Also see