help mi mvncat -------------------------------------------------------------------------------

Title

mi mvncat -- Assign "final" values to (mvn) imputed categorical variables

Syntax

mi mvncat dset [(reference)] [\ dset [(reference)] ... ] [, options]

where dset is a set of dummy variables, representing a variable with more than two categories

reference is the reference dummy. Make sure to enclose reference in parentheses. See Should the reference dummy be specified?

options Description ------------------------------------------------------------------------- report display dset and reference category noupdate do not update MI data; see mi update -------------------------------------------------------------------------

Description

mi mvncat assigns "final" values to multiple imputed categorical variables, using the procedure described by Allison (2002:40). Categorical variables with k levels are supposed to be represented with k - 1 dummies in the dataset. After these dummies are (multiple) imputed using multivariate normal regression (mi impute mvn), mi mvncat assigns values 0 or 1 to each dummy, ensuring that dummies representing one categorical variable add up to 1.

Options

+---------+ ----+ Options +----------------------------------------------------------

report displays dsets and corresponding reference categories. If the reference dummy is not registered imputed, it is reported as "no reference category".

noupdate suppresses mi update.

Remarks

When to use mi mvncat?

Suppose a dataset, containing different types of variables with arbitrary missing pattern. In this case multivariate normal regression may be used to (multiple) impute missing values. Although this method is originally designed for continuous (normally distributed) variables, Allison (2002:38-40) describes how the multivariate normal regression may be used to impute dummies or categorical variables.

Steps previous to mi mvncat

1. Create k - 1 dummies for each categorical variable with k levels in the original dataset (m = 0). You may also create all k dummies, but only impute k - 1 later. Make sure dummies have hard missings (.a, ..., .z), where the categorical variable they represent has hard missings.

2. mi set your dataset wide.

3. mi register imputed the k - 1 dummies for each categorical variable with k levels. You may register all k dummies imputed.

4. mi impute mvn values for k - 1 dummies for each categorical variable with k levels.

What does mi mvncat do?

For a categorical variable with 3 levels (thus 2 dummies), Allison (2002:40) suggests to

1. calculate a reference category as 1 - imputed_dummy1 - imputed_dummy2

2. assign value 1 to whichever category has the largest (imputed) value. If the reference category happens to be coded 1, assign value 0 to both dummies.

mi mvncat follows this approach, assigning values 0 and 1 to all dummies that are a) registered imputed and b) specified in dset.

Should the reference dummy be specified?

Given the reference dummy has been created, it is never wrong to ...

(a) specify the reference dummy in dset

(b) specifiy the reference dummy in (reference)

..., but it is not always necessary.

There are two possible scenarios after the imputation step.

In the first scenario, all (soft) missing values in the k-1 dummies in m = 0 are imputed in m > 0. In this case, mi mvncat determines the reference automatically. You do not have to specify (reference). If the kth dummy has been created, is registered imputed and specified in dset, mi mvncat will assign "final" values to all k dummies in m > 0. If the kth dummy has not been created, or is not registered imputed, or is not specified in dset, only the k-1 dummies are assigned "final" values.

In the second scenario, there are soft missing values in more than one imputed dummy in m > 0. In this case, mi mvncat cannot determine the reference category and will exit with an error. You will have to specify a reference dummy in (reference). Dummies are assigned "final" values, if they have been created and are registred imputed. Note, that there might be a situation, in which two or more imputed dummies have soft missing values in m > 0, but a reference dummy has not been created. In this case, you have to specify (reference) and choose a (arbitrary) name that must not be the name of an existing variable.

Example In this example I do not want to show how to properly impute missing values. The point is to illustrate, how mi mvncat works.

. sysuse nlsw88 ,clear (NLSW, 1988 extract)

Create some missing values in race and industry.

. replace race = . in 1/150 (150 real changes made, 150 to missing)

. replace industry = . in 100/300 (201 real changes made, 201 to missing)

Create dummies.

. tabulate race ,generate(race) nofreq

. tabulate industry ,generate(ind) nofreq

Remember to copy hard missings from race and industry (if there are any), when creating the dummies. One way to do this, is using the chm prefix (if installed) with tabulate.

Declare data to be MI data. Note that mi mvncat requires the style to be "wide".

. mi set wide

Register variables to be imputed. Here the reference category for industry is not registered and will therefore not exist in imputed datasets (m > 0).

. mi register imputed race1 race2 race3 ind1-ind3 ind5-ind12

Impute values using mvn-method (see mi impute). Choose race2 and ind4 as reference categories.

. mi impute mvn /// race1 race3 ind1-ind3 ind5-ind12 = age married grade wage ,add(5)

[output omitted]

. list _1_race1 _1_race2 _1_race3 _5_ind1 _5_ind2 _5_ind12 in 96/105

+----------------------------------------------------------------- > + | _1_race1 _1_race2 _1_race3 _5_ind1 _5_ind2 _5_ind12 > | |----------------------------------------------------------------- > | 96. | 1.26249 . .113094 0 0 0 > | 97. | .061359 . .017226 0 0 0 > | 98. | 1.23018 . -.110508 0 0 0 > | 99. | .500781 . .057241 0 0 0 > | 100. | .364564 . .048897 -.086906 -.013051 .425884 > | |----------------------------------------------------------------- > | 101. | 1.32567 . .145112 -.073345 -.035808 -.292849 > | 102. | 1.48026 . .04562 -.057562 -.080453 .094772 > | 103. | .290545 . .027436 -.039534 -.021071 -.034177 > | 104. | 1.62895 . -.031766 -.060089 .013959 -.23388 > | 105. | 1.21151 . -.006183 .02949 -.006173 .528441 > | +----------------------------------------------------------------- > +

Since all variables listed are dummies, representing categorical variables, they should only contain values 0 and 1 (as the non-missing observations 96-99 in _5_indx). Furthermore dummies representing one categorical variable should add up to 1.

. mi mvncat race1 race2 race3 \ ind1-ind12

. list _1_race1 _1_race2 _1_race3 _5_ind1 _5_ind2 _5_ind12 in 96/105

+---------------------------------------------------------------+ | _1_race1 _1_race2 _1_race3 _5_ind1 _5_ind2 _5_ind12 | |---------------------------------------------------------------| 96. | 1 0 0 0 0 0 | 97. | 0 1 0 0 0 0 | 98. | 1 0 0 0 0 0 | 99. | 1 0 0 0 0 0 | 100. | 0 1 0 0 0 0 | |---------------------------------------------------------------| 101. | 1 0 0 0 0 0 | 102. | 1 0 0 0 0 0 | 103. | 0 1 0 0 0 0 | 104. | 1 0 0 0 0 0 | 105. | 1 0 0 0 0 0 | +---------------------------------------------------------------+

Omitting race2 (the reference) from mi mvncat race1 race2 race3 will leave _1_race2 (and all _m_race2) unchanged (i.e. soft missing), while still correctly assigning values 0 or 1 to _m_race1 and _m_race3.

References

Allison, Paul D. (2002) Missing Data. Thousand Oaks, CA: Sage Publications.

Author

Daniel Klein, University of Bamberg, klein.daniel.81.@gmail.com

Also see

Online: mi, egen

if installed: chm