Title

obsofint-- Displays observations of interest.

Syntax

obsofint[varlist] [if] [in] [weight] [,zcutoff(#)whiskerlength(#)ztukeyasymtukeyidlist(varlist)sortidnosortloudsummarize[(stats[,verbose])]generate(stub[,replaceall])list_options]

fweights,aweights, andpweightsare allowed.

Description

obsofintis intended to help scan a large number of variables for unusual observations. These unusual observations are unusual in the sense that they either have a much smaller or a much larger value on a given variable than the bulk of the data. Such unusual observations are commonly referred to as outliers. While such observations are commonly referred to as outliers, we feel that this term is increasingly associated with the bad practice of automatically considering such observations as a problematic cases which should automatically be deleted. So instead we will use the term observations of interest. On the one hand, this terminology allows for the possibility that these observations are date entry errors, while on the other hand it also allows for the possibility that these observations are legitimate observations. In the former case deletion of that observation is only a measure of last resort, while in the latter case deletion would remove especially informative observations.The observations of interest are identified using a criterium that is an adaptation of the commonly used Tukey bounds. In our experience these Tukey bounds flagged too many values as extreme values if a variable is either skewed or has a spike (a value that is very common). This is not surprising as these bounds were intended for normal/Gaussian-like variables, but it makes these bounds less useful for scanning a large set of variables for unusual observations. So, instead

obsofintwill use the following generalization of the Tukey bounds:The traditional Tukey bounds are:

lb = Q_1 - 3*(Q_3 - Q_1), ub = Q_3 + 3*(Q_3 - Q_1)

The adjusted Tukey bounds are:

lb = Q_1 - 6*(Q_2 - Q_1), ub = Q_3 + 6*(Q_3 - Q_2)

lb and ub are respectively the lower and the upper bounds. Q_1, Q_2, and Q_3 are the first, second, and third quartile.

These adjusted Tukey bounds tend to lead to less false positives --- that is, less observations that are flagged as of interest that are actually perferctly normal --- when the data is skewed. However, these bounds do still lead to too many false positives when the variable contains a spike that is large enough to make either the first and second quartile, or the second and third quartile, or all three quartiles the same. If this happens,

obsofintwill automatically change the criterium of an observation of interest to a deviation of more than 3 standard deviation from the mean.

Options

whiskerlength(#)specifies the number of inter-quartile ranges one needs to deviate from the 1st or 3rd quartile in order to be classified as an observation of interest, when the traditional Tukey bounds are used. The default is 3. In the adjusted Tukey ranges, that are default inobsofint, it specifies 1/2 of the number of distances between the lower quartile and the median or the higher quartile and the median. Notice that this way the traditional Tukey bounds and the adjusted tukey bounds lead to exactly the same results when the distribution of that variable is symetric.

zcutoff(#)specifies the number of standard deviations that an observation needs to deviate from the mean in order to be classified as an observation of interest. The default is 3.

zspecifies that only the z-score criterium --- that is, the number of standard deviations that an observation deviates from the mean --- is to be used when identifying observations of interest. When this option is specified, one can not specify thewhiskerlength()option.

tukeyspecifies that only the traditional Tukey bounds are to be used when identifying observations of interest. When this option is specified, one can not specify thezcutoff()option.

asymtukeyspecifies that only the adjusted Tukey bounds are to be used when identifying observations of interest. When this option is specified, one can not specify thezcutoff()option.

idlist(varlist)specifies the variables that are to be listed next to the extreme values. Typically these will be either identification numbers and/or variables that might explain why an observation might be exceptional.

sortidspecifies that observations are sorted by the variables specified inidlist(). Theidlist()option must thus be specified when specifying thesortidoptions. The default is that the observations are sorted by the variable being listed.

nosortspecifies that the observation are not sorted. The default is that the observations are sorted by the variable being listed.If the

nosortoption is specified the observation numbers will be displayed as is normal in list. When thenosortoption is specified, the observation number will be displayed as an variable with the nameobs_nrunless there is already a variable with that name. In that case it will use the nameobsunless there is also a variable with that name. In that case the name_nwill be used.

loudspecifies thatobsofintwill display a message for each variable that does not have any observations flagged as of interest. By default,obsofintdisplays nothing for those variables.

summarize[(stats[,verbose])] specifies that for those variables that contain observations of interest, the statisticsstatswill be displayed in the report. If no statistics have been it will display N, mean, sd, min, p25, p50, p75, and max. One or more of the following statistics may be specified: N, sum_w, mean, Var, sd, skewness, kurtosis, sum, min, max, p1, p5, p10, p25, p50, p75, p90, p95, p99. Theverbosesuboption

generate(stub[,replaceall])specifies that indicator variables are be created for each variable containg variables of interest, which will be 1 if that observation is of interest and 0 otherwise. These indicator variables will be calledstub_variablename. If these variables already existobsofintwill exit with an error unless thereplacesub-option is specified, in which case the existing indicator variables will be overwritten. If theallsuboption is specified these indicator variables will be created for all variables checked byobsofint, regardless whether observations of interest were found or not.

list_optionsall options forlistcan also be specified forobsofintand will be used when listing observations of interest.

Saved results

r(result)Contains a matrix with a row for each variable invarlist. The first column shows the number of observation classified as of interest, the second and third columns show the lower and upper bound used to classify the observations. These bounds will be missing when the variable is a constant. The last three collumns indicate whether the z-score, adjusted Tukey bounds, or traditional Tukey bounds criterium was used to identify observations of interest. The criterium used is represented by a 1 and the remaining criteria will receive a 0. If a variable is a constant, all criteria will receive a 0.

Example

sysuse auto, clearobsofint, idlist(make)(click to run)

sysuse auto, clearobsofint price - foreign, loud idlist(make) sum(click to run)

sysuse auto, clearobsofint price - foreign, idlist(make) tukey(click to run)

AuthorsMaarten L. Buis Universitaet Tuebingen Institut fuer Soziologie maarten.buis@uni-tuebingen.de

Ronnie Babigumira Center for International Forestry Network (CIFOR) The Poverty Environment Network (PEN) r.babigumira@cgiar.org

Acknowledgement

obsofintwas written while Maarten Buis was visiting CIFOR as a consultant to work on the PEN project.

