------------------------------------------------------------------------------- help formicombinePatrick Royston -------------------------------------------------------------------------------

Estimation of regression models with multiply imputed samples

micombine{supported_regression_cmd|other_regression_cmd} [yvar] [covarlist] [other_stuff][ifexp] [inrange] [weight] [,brnoconstantdetaileform|{eform(string)}genxb(newvarname)impid(varname)infgainlrrnowarningobsid(varname)svy[(svy_options)]regression_cmd_options]where

supported_regression_cmds are clogit, cnreg, glm, logistic, logit, mlogit, ologit, oprobit, poisson, probit, qreg, regress, rreg, stcox, streg, or xtgee, andother_regression_cmdis any other Stata regression command (see Remarks).

micombineshares a subset of the features of all estimation commands (see help estimates); seeRemarks.All weight types supported by

regression_cmdare allowed; see help weights.

Description

micombineestimates the parameters of a regression model whose type is determined bysupported_regression_cmdorother_regression_cmd. Parameter estimates are combined across several replicates obtained previously by multiple imputation, e.g. by using ice to create a file of imputed data. SeeRemarksfor a brief account of howmicombinecombines the estimates and obtains standard errors.

Options

brcalculates degrees of freedom and tests of significance for each predictor according to the formulae (3)-(5) of Barnard & Rubin (1999). After estimation, the required degrees of freedom are stored in a matrix (column vector)e(nutilde). Note that iftestis used aftermicombinefor significance testing of regression coefficients, such tests assume that the degrees of freedom are equal to the number of observations minus the number of parameters estimated, not those given ine(nutilde).

detailgives details of the regression model for each imputation.

eform(string)indicates that the exponentiated form of the coefficients is to be output and reporting of the constant is to be suppressed;stringis used to label the exponentiated coefficients.

eformindicates that the exponentiated form of the coefficients is to be output and reporting of the constant is to be suppressed; the exponentiated coefficients are labelledexp(b).

genxb(newvarname)createsnewvarnameto hold the linear predictor from each regression model, averaged over all the imputations.

impid(varname)specifies thatvarnameis the variable identifying the imputations. The number of imputations is determined as the number of unique values ofvarname. All observations for whichvarnametakes the value zero are ignored in the analysis. Defaultvarname:_mj.

infgainreports the percentage increase in information and sample size due to the use of multiple imputation. The information gain is the percent increase in Wald chisquare for the entire model, comparing the Wald chisquare for the model on the original data (complete case analysis) with that using the variance-covariance matrix of the parameters estimated using Rubin's rules. With a bad imputation model the information increase could be negative.

lrrspecifies that the Li-Raghunathan-Rubin (LRR) robust estimate of the variance-covariance matrix of the regression coefficients be used.

noconstantsuppresses the regression constant in all regressions.

nowarningsuppresses the warning message about the use ofother_regression_cmds (seeRemarks).

obsid(varname)is provided to allowmicombineto analyse datasets created by programs other thanice.varnamespecifies the name of a variable holding the "observation ID", i.e. the sequence number of each observation in a given imputation. The number of observations should be identical between imputations, as should the order of the observations.varnameshould run 1,...,N for imputation 1, 1,...,N for imputation 2, and so on.iceautomatically stores the information with the data, so this option is not required. Defaultvarname:_mi.[Stata 9]

svy[(svy_options)] performs survey regression. The prefixsvy:is placed beforeregression_cmd. Ifsvy_optionsis supplied then,svy_optionsis placed aftersvyand before the colon. The data must besvysetbefore this option is used. This must be done beforeiceis used to impute missing values. That the data have beensvysetis inherited by the file of imputations created byice.[Stata 8]

svyperforms survey regression. The prefixsvyis placed beforeregression_cmd, so that for examplemicombine regress ..., svyis interpreted asmicombine svyregress .... Options for survey regression are included as options tomicombine. The data must besvysetbefore thesvyoption is used. This must be done beforeiceis used to impute missing values. That the data have beensvysetis inherited by the file of imputations created byice.

regression_cmd_optionsmay be any of the options appropriate toregression_cmd.

RemarksDetails of statistical inference from multiple imputed datasets are nicely described in a recent Stata Journal article by John Carlin and colleagues (Carlin et al, 2003). Here, with due acknowledgment to John, I give an edited version of Section 2 of his article.

A simple method of combining estimates from several models was derived by Rubin (1987). Suppose initially that primary interest lies in estimating a scalar quantity, Q. Here, Q is a regression coefficient, for example, the log hazard ratio in a proportional hazards model. Suppose that we have imputed m complete datasets using an appropriate model. In each dataset, standard complete-data methods are used to obtain an estimate of Q with an associated standard error. Let Q_j and V_j denote the point estimate and variance respectively from the jth (j = 1, 2, ... , m) dataset. The point estimate Q^ of Q from multiple imputation is simply the arithmetic mean of Q_1,...,Q_m.

Obtaining a valid standard error for this estimate of Q^ requires combining information on within-imputation and between-imputation variation. The latter is important in reflecting uncertainty due to variability between imputation samples. First, a within-imputation variance component, W, is obtained as the mean over the m imputations of the complete-data variance-covariance matrices, V_1,....,V_m. Second, a between-imputation variance component, B, is calculated as the sum of squares of Q_1,....,Q_m about Q^, divided by m-1. In summary,

Q^ = (Q_1 + ... + Q_m)/m

W = (V_1 + ... + V_m)/m

B = ((Q_1 - Q^)^2 + ... + (Q_m - Q^)^2)/(m - 1)

The (total) variance T of Q^ is given by

T = W + B * (1 + 1/m)

Rubin (1987) showed that (Q - Q^)/sqrt(T) is distributed approximately as Student's t on nu degrees of freedom, where

nu = (m - 1) * (1 + W /(B * (1 + 1/m)))^2

The (1 + 1/m) term in these expressions indicates that it is not necessary to a create large number of imputed datasets, particularly when B is much smaller than W. The condition will be satisfied unless there is much missing data and the parameter estimates within each dataset are very precise.

Available regression commands

micombinehas been tested with the commands listed undersupported_regression_cmdat the beginning of this help file.micombinemaywork satisfactorily withother_regression_cmds, but this cannot be guaranteed. This facility is provided so that the researcher familiar with a particular Stata command has a fighting chance of obtaining correct MI estimates and standard errors.HOWEVER, THE AUTHOR DISCLAIMSRESPONSIBILITY FORTHE CORRECTNESS OF RESULTS ARISING FROM USE OF ANother_regression_cmd. Note thatother_stuffin the syntax diagram is code that may be required by someother_regression_cmds, for exampleivregwants(varlist2=varlist_iv).micombineparses for the occurrence of an opening parenthesis. There may be other syntaxes that are not accommodated by this approach; if so, please contact the author with details.

Post-estimation predictionThe

predictcommandmaywork as you expect aftermicombine, but this feature should be treated with caution.micombinestores the quantities needed bypredictat the last execution of the regression command, that is at the final imputation, but prediction following some regression commands has non-standard features that are hard to emulate accurately. Known issues are as follows:1. After

micombine mlogit:predictmay require that the outcome levels are known as 0, 1, 2, ... , so it may be necessary to drop the score label for the outcome variable, if such a label is defined. This is KNOWN to be a problem usingmfxfollowingmicombine mlogit. For example,mfx compute, predict(outcome(0))will work only if the lowest level of the outcome is 0, and is not labelled.

Sample sizeThe sample size reported by

micombineis the number of observations found when fitting the model in the first imputation (i.e. by default: for_mj==1). It may happen that the sample size varies between imputations, for example, when the effect of aniforinfilter differs between imputations, or when a weighting scheme effectively removes different observations in different imputations. The resulting parameter estimates and their SEs are believed still to be approximately valid in this situation. The program alerts you to the occurrence of variable sample size, but no action need be taken by you. Post-estimation commandstestandtestparmuse the sample size found when the model is fitted in the final imputed dataset.

Examples

. ice y x1 x2 x3 using imp, m(10) genmiss(m_). use imp, clear. micombine regress y x1 x2 x3. stset time, failure(cens). micombine stcox x1 x2 x3, genxb(index). test x2==1. testparm x1 x2. micombine regress y x1 x2 x3, svy(subpop(if sex==1))

AuthorPatrick Royston, MRC Clinical Trials Unit, London. pr@ctu.mrc.ac.uk

ReferencesBarnard, J. and D. B. Rubin. 1999. Small-sample degrees of freedom with multiple imputation.

Biometrika86: 948-955.Carlin, J. B., N. Li, P. Greenwood, and C. Coffey. 2003. Tools for analyzing multiple imputed datasets.

Stata Journal3(3): 226-244.Carlin, J. B., N. Li, P. Greenwood, and C. Coffey. 2003. Tools for analyzing multiple imputed datasets.

Stata Journal3(3): 226-244.Li, K., T. Raghunathan, and D. Rubin. 1991. Large sample significance levels from multiply-imputed data using moment-based statistics and an F reference distribution.

Journal of the American StatisticalAssociation86: 1065-1073.Royston P. 2004. Multiple imputation of missing values.

Stata Journal4(3):227-241.Royston P. 2005. Multiple imputation of missing values: update.

StataJournal5(2):188-201.Royston P. 2006. Multiple imputation of missing values: update of ice.

Stata Journalin preparation.Rubin, D. 1987. Multiple Imputation for Nonresponse in Surveys. New York: John Wiley & Sons.

Schafer, J. 1997. Analysis of Incomplete Multivariate Data. London: Chapman & Hall.

van Buuren, S., H. C. Boshuizen and D. L. Knook. 1999. Multiple imputation of missing blood pressure covariates in survival analysis.

Statistics in Medicine18:681-694. (Also see http://www.multiple-imputation.com.)

Also see