help hotdeck -------------------------------------------------------------------------------

Title

Impute missing values using the hotdeck method

Syntax

hotdeck [varlist] [using] [if exp] [in exp] , [ by(varlist) store impute(varlist) noise keep(varlist) command(command) parms(varlist) seed(#) infiles(filename filename ...) ]

Description

Hotdeck will tabulate the missing data patterns within the varlist. A row of data with missing values in any of the variables in the varlist is defined as a `missing line' of data, similarly a `complete line' is one where all the variables in the varlist contain data. The hotdeck procedure replaces the varlist variables in the `missing lines' with the corresponding values in the `complete lines'. Hotdeck should be used several times within a multiple imputation sequence since missing data are imputed stochastically rather than deterministically. The nmiss missing lines in each stratum of the data described by the `by' option are replaced by lines sampled from the nobs complete lines in the same stratum. The approximate Bayesian bootstrap method of Rubin and Schenker(1986) is used; first a bootstrap sample of nobs lines are sampled with replacement from the complete lines, and the nmiss missing lines are sampled at random (again with replacement) from this bootstrap sample.

A major assumption with the hotdeck procedure is that the missing data are either missing completely at random (MCAR) or is missing at random (MAR), the probability that a line is missing varying only with respect to the categorical variables specified in the `by' option.

If a dataset contains many variables with missing values then it is possible that many of the rows of data will contain at least one missing value. The hotdeck procedure will not work very well in such circumstances. There are more elaborate methods that only replace missing values, rather than the whole row, for imputed values. These multivariate multiple imputation methods are discussed by Schafer(1997).

A critical point is that all variables that are used in the analysis should be included in the variable list. This is particularly true for variables that have missing data! Variables that predict missingness should be included in the by option so missing data is imputed within strata.

Latest Version

The latest version is always kept on the SSC website. To install the latest version click on the following link

ssc install hotdeck, replace.

Options

using specifies the root of the imputed datasets filenames. The default is "imp" and hence the datasets will be saved as imp1.dta, imp2.dta, ....

by(varlist) specifies categorical variables defining strata within which the imputation is to be carried out. Missing values will be replaced by complete values only within the strata. If within a strata there are no complete records then no data will be imputed and will lead to the wrong answers. Make sure there are a reasonable number of complete records per strata.

store specifies whether the imputed datasets are saved to disk.

impute(varlist) specifies the number of imputed datasets to generate. The number needed varies according to the percentage missing and the type of data, but generally 5 is sufficient.

noise specifies whether the individual analyses, from the command() option, are displayed.

keep(varlist) specifies the variables saved in the imputed datasets in addition to the imputed variables and the by list. By default the imputed variables and the by list are always saved.

command(command) specifies the analysis performed on every imputed dataset.

parms(varlist) specifies the parameters of interest from the analysis. If the command is a regression command then the parameter list can include a subset of the variables specified in the regression command.The final output consists of the combined estimates of these parameters. For non-standard commands that are "regression" commands the parms() option looks at the estimation matrix e(b) and requires the column names to identify the coefficients of interest.

seed(#) specifies the random number generator seed. When using the seed option the hotdeck command must be used in the correct way. The key point is that ALL variables in the analysis command must be in the variable list, this ensures that the correlations between the variables are maintained post imputation.

infiles(filename filename ...) specifies a list of files that have missing values replaced by imputed values. This is convenient when the user has several imputed datasets and wants to analyse them and combine the results.

Examples

Impute values for y in sex/age groups.

hotdeck y, by(sex age)

Additionally to store the imputed datasets above as imp1.dta and imp2.dta.

hotdeck y using imp,store by(sex age) impute(2)

Hotdeck can also use the stored imputed datafiles hi:imp1.dta} and imp2.dta and carry out the combined analysis. This analysis is displayed for the coefficient of x and constant term _cons.

hotdeck y using imp, command(logit y x) parms(x _cons) infiles(imp1 imp2)

Do not save imputed datasets to disk but carry out a logistic regression on the imputed datasets and display the coefficients for x and the constant term _cons of the model.

hotdeck y x, by(sex age) command(logit y x) parms(x _cons) impute(5)

Example - Multiple Equation Model

Multiple equation models require more complicated parms() statements. The example used can be applied to all multiple equation models. The only complication is that the name of the coefficients are different.

For the following command

xtreg kgh f1, mle

Then inspect the matrix of coefficients

mat list e(b)

e(b)[1,4] kgh: kgh: sigma_u: sigma_e: f1 _cons _cons _cons y1 -1.6751401 77.792948 0 16.730843

Then the following command will do an imputation and analysis for the single pa > rameter.

hotdeck kgh, by(ethn) command(xtreg kgh f1, mle) parms(kgh:f1) impute(5)

Example - mlogit

Use this web dataset for STATA release 9.

use http://www.stata-press.com/data/r9/sysdsn3.dta

The simple model without handling missing data

mlogit insure male

The estimated coefficients are put automatically by STATA into the matrix e(b), note the column headings are the parameter names that hotdeck uses. So you can not use the simple syntax of just parms(male) because this refers to two parameters.

mat list e(b)

So this syntax will handle the missing data using hotdeck imputation.

hotdeck insure male, command(mlogit insure male) parms(Prepaid:male) impute(5)

NOTE hotdeck will fail when using mlogit with spaces in the category labels. This is due to the lack of functionality in STATA's matrix commands.

Author

Adrian Mander, MRC Human Nutrition Research, Cambridge, UK.

Email adrian.mander@mrc-hnr.cam.ac.uk

See Also Related commands

HELP FILES Installation status SSC installation links Descrip > tion

whotdeck (if installed) (ssc install whotdeck) > Weighted version of Hotdeck