{smcl}
{* *! version 1.1 21dec2016}{...}
{title:Title}
{pstd}
{manlink MI mi impute wmlogit} {hline 2} Weighted multiple imputation for categorical variables using multinomial logistic regression
{marker syntax}{...}
{title:Syntax}
{p 8 19 2}{cmd:mi} {cmdab:imp:ute} {cmd:wmlogit}
{it:ivar} [{it:{help indepvars}}]{cmd:,} {opt md(varname)}
[{it:{help mi_impute_wmlogit##options_table:options}}]
{synoptset 22 tabbed}{...}
{synopthdr:options}
{synoptline}
{syntab:Main}
{synopt: {it:impute_options}}any option of {helpb mi_impute##impopts:mi impute} except {cmd:noupdate} and {cmd:by()}.{p_end}
{p2coldent :* {opt md(varname)}}specify the variable containing the marginal distribution of {it:ivar}.{p_end}
{synopt: {opt marginal}}use marginal weights in weighted multiple imputation; default is conditional weights.{p_end}
{synoptline}
{p 4 6 2}
* {opt md(varname)} is required.{p_end}
{p 4 6 2}
The data must be {cmd:mi set} before using {cmd:mi impute wmlogit};
see {manhelp mi_set MI:mi set}.{p_end}
{p 4 6 2}
{it:ivar} must be {cmd:mi register} as imputed before using {cmd:mi impute wmlogit}; see {manhelp mi_set MI:mi set}.{p_end}
{p 4 6 2}
{it:indepvars} may not contain {help factor_variables:factor variables}; dummy variables for {it:indepvars} must be created prior to using {cmd:mi impute wmlogit}.
{marker menu}{...}
{title:Menu}
{phang}
{bf:Statistics > Multiple imputation}
{marker description}{...}
{title:Description}
{dlgtab:Population-calibrated inference for incomplete binary/categorical variables via weighted multiple imputation}
{pstd}
{cmd:mi} {cmd:impute} {cmd:wmlogit} fills in missing values of a categorical variable
{it:ivar} by using a weighted multinomial logistic regression imputation method, where weights are calculated
using the population-level marginal distribution of {it:ivar}.
{pstd}
For some variables in certain datasets, their corresponding marginal distributions in the population
can be obtained from external data sources (e.g., population surveys or censuses). If our study samples
in truth come from such a population, the population information can be fed into the imputation model
to calibrate inference to the population. In weighted multiple imputation, the population distribution
of {it:ivar} can be used to calculate probability weights, which are then used in multiple imputation
such that the post-imputation distribution matches the population level.
{dlgtab:Imputaion procedure}
{pstd}
In the imputation step, marginal/conditional weights calculated using the population marginal distribution of {it:ivar}
are attached to the complete cases, and a multinomial logistic regression model is fitted to the weighted complete
cases to obtain the maximum likelihood estimates of imputation model parameters {it:theta} and
their asymptotic sampling variance {it:U}. New parameters are then drawn from the large-sample
approximation N({it:theta}, {it:U}) to its posterior distribution assuming non-informative priors.
Finally, imputed values are simulated from the multinomial logistic regression using these newly drawn parameters.
{dlgtab:Derivation of marginal and conditional weights}
{pstd}
Let the dataset of size {it:N} contain {it:n_obs} subjects with observed {it:ivar} and {it:n_mis} subjects with
missing {it:ivar}. The first set of weights is termed {it:marginal weights}, which only depend on the population
distribution of {it:ivar}. For each category {it:j} of {it:ivar}, the proportion of this category required
in the imputed data is calculated as {it:p_imp = (p_pop*N - p_obs*n_obs)/n_mis}, where {it:p_pop} and {it:p_obs}
denote the proportions of category {it:j} in the population and observed in the dataset, respectively.
Marginal weight for subjects with observed values of {it:ivar} in category {it:j} is therefore given by
{it:w_m = 1/(p_obs/p_imp)}.
{pstd}
If there are covariates {it:indepvars} in the imputation model, their associations with {it:ivar} are not
reflected in marginal weights. Alternatively, weights can be calculated such that they account for the
effects of covariates in the imputation model. These weights are termed {it:conditional weights}, and can be derived
using the distribution of {it:ivar} obtained after estimating parameters of an imputation model
assuming MAR using the complete cases. Suppose the estimated proportion of category {it:j} of
{it:ivar} in the complete data after fitting a MAR imputation model to the complete cases is {it:p_j},
then the proportion of this category required in the imputed data is given by {it:p_imp = (p_pop*N - p_j*n_obs)/n_mis},
which implies that the conditional weight for this group is given by {it:w_c = 1/(p_j/p_imp)}.
{marker options}{...}
{title:Options}
{phang}
{it:impute_options} include {cmd:add()}, {cmd:replace}, {cmd:rseed()},
{cmd:double}, {cmd:dots}, {cmd:noisily}, {cmd:nolegend}, {cmd:force}; see
{manhelp mi_impute MI:mi impute} for details.
{phang}
{opt md(varname)} specifies the variable containing the marginal distribution of {it:ivar}, which is used to derive marginal/conditional weights in weighted multiple imputation.
{phang}
{opt marginal} specifies that weighted multiple imputation is performed using marginal weights; conditional weights are used by default when {opt marginal} is not stated.
{marker remarks}{...}
{title:Remarks}
{pstd}
{cmd:mi} {cmd:impute} {cmd:wmlogit} works by calling the {help uvis:uvis} (univariate imputation
sampling) command, which imputes missing values in the single variable {it: ivar} based
on multiple regression on {it: indepvars}.
{pstd}
{cmd:mi} {cmd:impute} {cmd:wmlogit} requires Stata version 14.1 and the {help uvis:ice} command from SSC.
{pstd}
Option {opt replace} currently works only with {help mi_styles:mi styles} {cmd:wide}; data must be {help mi_set:mi set} before using {cmd:mi impute wmlogit}.
{pstd}
{it:indepvars} may not contain {help factor_variables:factor variables}; dummy variables for {it:indepvars} must be created prior to using {cmd:mi impute wmlogit}.
{marker example}{...}
{title:Example}
{pstd}
Generate full data of 3-category X and continuous Y{p_end}
{phang2}
{cmd:. set seed 715209}
{p_end}
{phang2}
{cmd:. clear}
{p_end}
{phang2}
{cmd:. set obs 100000}
{p_end}
{phang2}
{cmd:. local beta0 = 0.5}
{p_end}
{phang2}
{cmd:. local betax2 = 1.5}
{p_end}
{phang2}
{cmd:. local betax3 = 0.75}
{p_end}
{phang2}
{cmd:. gen byte x = 1 if _n < 30001}
{p_end}
{phang2}
{cmd:. replace x = 2 if _n < 50001 & missing(x)}
{p_end}
{phang2}
{cmd:. replace x = 3 if missing(x)}
{p_end}
{phang2}
{cmd:. xi i.x}
{p_end}
{phang2}
{cmd:. gen y = rnormal(`beta0' + `betax2'*_Ix_2 + `betax3'*_Ix_3, 1)}
{p_end}
{pstd}
Fit the model of interest to full data{p_end}
{phang2}
{cmd:. reg y i.x}
{p_end}
{pstd}
Population marginal distribution of X is assumed to be the same as full-data distribution{p_end}
{phang2}
{cmd:. gen mdx = 0.3 if x == 1}
{p_end}
{phang2}
{cmd:. replace mdx = 0.2 if x == 2}
{p_end}
{phang2}
{cmd:. replace mdx = 0.5 if x == 3}
{p_end}
{pstd}
Generate missing data in X such that X is MNAR conditional on X and Y{p_end}
{phang2}
{cmd:. local phi0 = -2.5}
{p_end}
{phang2}
{cmd:. local phix2 = 0.45}
{p_end}
{phang2}
{cmd:. local phix3 = 0.25}
{p_end}
{phang2}
{cmd:. local phiy = 0.95}
{p_end}
{phang2}
{cmd:. gen byte x_mnar = x if runiform() >= invlogit(`phi0' + `phix2'*_Ix_2 + `phix3'*_Ix_3 + `phiy'*y)}
{p_end}
{pstd}
Fit the model of interest to the complete cases{p_end}
{phang2}
{cmd:. reg y i.x_mnar}
{p_end}
{pstd}
Declare data and register {cmd:x_mnar} as imputed{p_end}
{phang2}
{cmd:. mi set wide}
{p_end}
{phang2}
{cmd:. mi register imputed x_mnar}
{p_end}
{phang2}
{cmd:. mi register regular y}
{p_end}
{pstd}
Impute {cmd:x_mnar} using weighted multiple imputation with marginal weights
{p_end}
{phang2}
{cmd:. mi impute wmlogit x_mnar y, md(mdx) marginal add(10) noi dots}
{p_end}
{pstd}
Impute {cmd:x_mnar} using weighted multiple imputation with conditional weights
{p_end}
{phang2}
{cmd:. mi impute wmlogit x_mnar y, md(mdx) replace}
{p_end}
{pstd}
Fit the model of interest to each imputed dataset and combine results using Rubin's rules
{p_end}
{phang2}
{cmd:. mi estimate: reg y i.x_mnar}
{p_end}
{pstd}
Check if the post-imputation distribution of {cmd:x_mnar} matches the population level
{p_end}
{phang2}
{cmd:. mi estimate: proportion x_mnar}
{p_end}
{marker results}{...}
{title:Stored results}
{pstd}
{cmd:mi impute wmlogit} stores the following in {cmd:r()}:
{synoptset 25 tabbed}{...}
{p2col 5 25 29 2: Scalars}{p_end}
{synopt:{cmd:r(negw)}}1 if negative weight is detected; 0 otherwise{p_end}
{synopt:{cmd:r(M)}}total number of imputations{p_end}
{synopt:{cmd:r(M_add)}}number of added imputations{p_end}
{synopt:{cmd:r(M_update)}}number of updated imputations{p_end}
{synopt:{cmd:r(k_ivars)}}number of imputed variables{p_end}
{synopt:{cmd:r(N_g)}}number of imputed groups{p_end}
{synoptset 25 tabbed}{...}
{p2col 5 25 29 2: Macros}{p_end}
{synopt:{cmd:r(wtype)}}type of imputation weights used{p_end}
{synopt:{cmd:r(method)}}name of imputation method ({cmd:wmlogit}){p_end}
{synopt:{cmd:r(ivars)}}names of imputation variables{p_end}
{synopt:{cmd:r(rngstate)}}random-number state used{p_end}
{synoptset 25 tabbed}{...}
{p2col 5 25 29 2: Matrices}{p_end}
{synopt:{cmd:r(pwmi)}}imputation weights by categories of {it:ivar}{p_end}
{synopt:{cmd:r(N)}}number of observations in imputation sample{p_end}
{synopt:{cmd:r(N_complete)}}number of complete observations in imputation sample{p_end}
{synopt:{cmd:r(N_incomplete)}}number of incomplete observations in imputation sample{p_end}
{synopt:{cmd:r(N_imputed)}}number of imputed observations in imputation sample{p_end}
{title:Author}
{pstd}
Tra My Pham, University College London, London UK.{break}
tra.pham.09@ucl.ac.uk
{title:Acknowledgement}
{pstd}
I would like to thank Tim Morris for his suggestions to improve this command.
{title:References}
{phang}
Royston P. 2004. Multiple imputation of missing values. Stata Journal 4(3):227-241.
{title:Also see}
{helpb mi impute}
{helpb mi impute mlogit}
{helpb mi impute wlogit}
{helpb mi estimate}