-------------------------------------------------------------------------------
help for fmiss: module to identify variables with problematic missing values 
                                                                    Version 1.0
-------------------------------------------------------------------------------

Syntax

fmiss [varlist] [if] [in] [, detail percentage level(real)]

Description

fmiss allows you to identify not only the total number missing values in each variable, but also how many of them are unique in the sense that for all other variables of the observation the information is available. This distinction is important to see which variable is causing a large drop in the sample size on its own. The module identifies missing value in numerical and string variable. For the case of numerical variables, also stata-coded missing values (e.g. “.a”) are identified. Since a main issue of missing values is that it might introduce a sample selection problem, fmiss offers a very simple and purely introductive way to detect such problems. Using the option detail, a mean-comparison test between the original sample and the sample one would get by including the variable (this means dropping the unique missing values) is computed and variable where the difference is significant are reported.

Options

detail The option detail includes a simple analysis on how the sample would change by dropping observations with unique missing values. A t-test is performed comparing the full sample to the sample one gets when dropping observations with unique missing values in the current variable. All variables where the mean of the excluded observation is significantly different from the mean of remaining observations are indicated. In case of not indicating such a choice, it does not mean that there is no problem of sample selection bias. This module only allows you to get a first impression of the data and potentially problematic variables.

level(real) This option allows you to change the level of significance for the t-test performed on the sample with and without the observations with unique missing values. The default value is 10%, since already at this level, severe sample selection problems are likely to be present. If you prefer the standard 5% threshold, simply add the option level(0.05)

percentage This option changes the output from frequencies to percentages of the total sample.

Detailed explanation of the output The output of the module is mostly self-explaining, however, some of the terms used might be somewhat unclear:

Missings

refers to the total number of missing values in the variable. This value corresponds to what you get using the command misstable for instance.

Unique missings

This is the number of missing values that are only missing in the current variable, not in the other variables of varlist. Independent of the order of deletion, these observations will always get lost when you include the variable.

Significant change in:

refers to the variables where a significant change in the mean occurs when excluding the observations with unique missings (in the current variable). This means that due to the inclusion of this variable the sample mean of the mentioned variable changes, which might cause a sample selection problem. This is not comparing the excluded observations to the remaining observations, but the full sample to the remaining.

Example

. sysuse lifeexp, clear (Life expectancy, 1998)

. fmiss region country safewater popgrowth lexp gnppc, detail

Analysis of missing variables in the dataset Total sample size: 68 Sample without any missing: 37 (54.40%) --------------------------------------------------------------------------- Unique Variable Missings missings Significant change in --------------------------------------------------------------------------- region 0 0 --- country 0 0 --- safewater 28 26 region popgrowth popgrowth 0 0 --- lexp 0 0 --- gnppc 5 3 --- (Smallest pvalue: 0.880) --------------------------------------------------------------------------- See help file for details on the exact definition of columns

In this example, fmiss perfoms the analysis on the all variables indicates in varlist. Without specifying a varlist, all variables of the dataset would be included. The option detail activates the t-tests comparing the full sample to the sample where the observations of the unique missings are excluded. This test is performed for all other variables, to see if the inclusion of the current variable would significantly change the mean of the other variables in the sample. In this example, we see that the average of region and popgrowth would significantly change when we drop the 26 observations with unique missing in the variable safewater. Beware of the fact that the t-test is performed on all numerical variables, even if they are coded! This t-test is not a proof that you have or not a sample selection problem, but it might help you as a starting point to identfiy such problems.

Known issues - Using the option detail the t-test is performed on all numerical variables, regardless of their structure. For the case of categorical variable this statistic is rather meaningless.

- The t-test gives only a very vague idea of possible changes in other variables due to the exclusion of missing values in an observation. For instance, any mean preserving change to the distritbution of the other variables will not be detected.

- If you find another issue, please send me an email indicating the problem.

Author

Florian Wendelspiess Chávez Juárez. University of Geneva, Department of Economics: florian@chavezjuarez.com. This is version 1.0, I plan to develop this module according to future needs. If you have suggestions,