------------------------------------------------------------------------------- help for hsmode -------------------------------------------------------------------------------

Half-sample modes

hsmode varlist [if exp] [in range] [, allobs format(format) name(#) spaces(#) ]

hsmode varname [if exp] [in range] [, allobs by(byvar) missing format(format) name(#) spaces(#) generate(newvar) ]

by ... : may also be used with hsmode: see help on by.

Description

hsmode calculates half-sample modes for varlist based on recursive selection of the half-sample with the shortest length. Although it has longer roots, this implementation of half-sample modes is based particularly on the ideas of Bickel and Frühwirth (2006).

Remarks

The idea of estimating the mode as the midpoint of the shortest interval that contains a fixed number of observations goes back at least to Dalenius (1965). See also Robertson and Cryer (1974), Bickel (2002) and Bickel and Frühwirth (2006) on other estimators of the mode.

The order statistics of a sample of n values of x are defined by

x(1) <= x(2) <= ... <= x(n-1) <= x(n).

The half-sample mode is here defined using two rules.

Rule 1. If n = 1, the half-sample mode is x(1). If n = 2, the half-sample mode is (x(1) + x(2)) / 2. If n = 3, the half-sample mode is (x(1) + x(2)) / 2 if x(1) and x(2) are closer than x(2) and x(3), (x(2) + x(3)) / 2 if the opposite is true, and x(2) otherwise.

Rule 2. If n >= 4, we apply recursive selection until left with 3 or fewer values. First let h_1 = floor(n / 2). The shortest half of the data from rank k to rank k + h_1 is identified to minimise

x(k + h_1) - x(k)

over k = 1, ..., n - h_1. Then the shortest half of those h_1 + 1 values is identified using h_2 = floor(h_1 / 2), and so on. To finish, use Rule 1.

The idea of identifying the shortest half is applied in the "shorth" named by J.W. Tukey and introduced in the Princeton robustness study of estimators of location by Andrews, Bickel, Hampel, Huber, Rogers and Tukey (1972, p.26) as the mean of the shortest half-length x(k), ..., x(k + h) for h = floor(n / 2). Rousseeuw (1984), building on a suggestion by Hampel (1975), pointed out that the midpoint of the shortest half (x(k) + x(k + h)) / 2 is the least median of squares (LMS) estimator of location for x. See Rousseeuw (1984) and Rousseeuw and Leroy (1987) for applications of LMS and related ideas to regression and other problems. Note that this LMS midpoint is also called the shorth in some recent literature (e.g. Maronna, Martin and Yohai 2006, p.48). Further, the shortest half itself is also sometimes called the shorth, as the title of Grübel (1988) indicates. For a Stata implementation and more detail, see shorth from SSC.

Some broad-brush comments follow on advantages and disadvantages of half-sample modes, from the standpoint of practical data analysts as much as mathematical or theoretical statisticians. Whatever the project, it will always be wise to compare hsmode results with standard summary measures (e.g. medians or means, including geometric and harmonic means) and to relate results to graphs of distributions. Moreover, if your interest is in the existence or extent of bimodality or multimodality, it will be best to look directly at suitably smoothed estimates of the density function.

1. Mode estimation By summarizing where the data are densest, the half-sample mode adds an automated estimator of the mode to the toolbox. More traditional estimates of the mode based on identifying peaks on histograms or even kernel density plots are sensitive to decisions about bin origin or width or kernel type and kernel half-width and more difficult to automate in any case. When applied to distributions that are unimodal and approximately symmetric, the half-sample mode will be close to the mean and median, but more resistant than the mean to outliers in either tail. When applied to distributions that are unimodal and asymmetric, the half-sample mode will typically be much nearer the mode identified by other methods than either the mean or the median.

2. Simplicity The idea of the half-sample mode is fairly simple and easy to explain to students and researchers who do not regard themselves as statistical specialists.

3. Graphic interpretation The half-sample mode can easily be related to standard displays of distributions such as kernel density plots, cumulative distribution and quantile plots, histograms and stem-and-leaf plots.

At the same time, note that

4. Not useful for all distributions When applied to distributions that are approximately J-shaped, the half-sample mode will approximate the minimum of the data. When applied to distributions that are approximately U-shaped, the half-sample mode will be within whichever half of the distribution happens to have higher average density. Neither behaviour seems especially interesting or useful, but equally there is little call for single mode-like summaries for J-shaped or U-shaped distributions. For U shapes, bimodality makes the idea of a single mode moot, if not invalid.

5. Ties The shortest half may not be uniquely defined. Even with measured data, rounding of reported values may frequently give rise to ties. What to do with two or more shortest halves has been little discussed in the literature. Note that tied halves may either overlap or be disjoint.

The procedure adopted in hsmode given t ties is to use the middlemost in order, except that that is in turn not uniquely defined unless t is odd. The middlemost is arbitrarily taken to have position ceiling(t / 2) in order, counting upwards. This is thus the 1st of 2, the 2nd of 3 or 4, and so forth.

This tie-break rule has some quirky consequences. Thus with values -9 -4 -1 0 -1 4 9, the rules yield -0.5 as the half-sample mode, not 0 as would be natural on all other grounds. Otherwise put, this problem can arise because for a window to be placed symmetrically the window length 1 + floor(n / 2) must be odd for odd n and even for even n, which is difficult to achieve given other desiderata, notably that window length should never decrease with sample size. We prefer to believe that this is a minor problem with datasets of reasonable size.

6. Rationale for window length Why half is taken to mean 1 + floor(n / 2) also does not appear to be discussed. Evidently we need a rule that yields a window length for both odd and even n; it is preferable that the rule be simple; and there is usually some slight arbitrariness in choosing a rule of this kind. It is also important that any rule behave reasonably for small n: even if a program is not deliberately invoked for very small sample sizes the procedure used should make sense for all possible sizes. Note that, given n = 1, the half-sample mode is just the single sample value, and, given n = 2, it is the average of the two sample values. A further detail about this rule is that it always defines a slight majority, thus enforcing democratic decisions about the data. However, there seems no strong reason not to use ceiling(n / 2) as an even simpler rule, except that if it makes much difference, then it is likely that your sample size or variable is unsuitable for the purpose.

7. Use with weighted data Identification of the half-sample mode for values associated with unequal weights is not supported at this time.

Options

allobs specifies use of the maximum possible number of observations for each variable. The default is to use only those observations for which all variables in varlist are not missing.

by() specifies a variable defining distinct groups for which statistics should be calculated. by() is allowed only with a single varname. The choice between by: and by() is partly one of precisely what kind of output display is required. The display with by: is clearly structured by groups while that with by() is more compact. To show statistics for several variables and several groups with a single call to hsmode, the display with by: is essential.

missing specifies that with the by() option observations with missing values of byvar should be included in calculations. The default is to exclude them.

format(format) specifies a numeric format for displaying results. The default is %8.2g.

name(#) specifies a maximum length for showing variable names (or in the case of by() values or value labels) in the display of results. The default is 32.

spaces(#) specifies the number of spaces to be shown between columns of results. The default is 2.

generate() specifies a new variable to hold calculated modes. generate() is allowed only with a single varname. This option is most useful when you want to save modes calculated for several groups for further analysis. Note that generate() is not allowed with the by: prefix: use the by() option instead. Values for the new variable will necessarily be identical for all observations in each group: typically it will be useful to select just one observation for each group, say by using egen, tag().

Examples

Robertson and Cryer (1974, p.1014) reported 35 measurements of uric acid (in mg/100 ml): 1.6, 3.11, 3.95, 4.2, 4.2, 4.62, 4.62, 4.62, 4.7, 4.87, 5.04, 5.29, 5.3, 5.38, 5.38, 5.38, 5.54, 5.54, 5.63, 5.71, 6.13, 6.38, 6.38, 6.67, 6.69, 6.97, 7.22, 7.72, 7.98, 7.98, 8.74, 8.99, 9.27, 9.74, 10.66. hsmode reports a mode of 5.38. Robertson and Cryer's own estimates using a rather different procedure are 5.00, 5.02, 5.04. kdensity's default supports hsmode here.

. hsmode price-foreign

. bysort rep78: hsmode mpg

. hsmode mpg, by(rep78) generate(hsmode)

Saved results

(for last-named variable or group only)

r(N) n r(hsmode) half-sample mode

Author

Nicholas J. Cox, Durham University, U.K. n.j.cox@durham.ac.uk

References

Andrews, D.F., P.J. Bickel, F.R. Hampel, P.J. Huber, W.H. Rogers and J.W. Tukey. 1972. Robust estimates of location: survey and advances. Princeton, NJ: Princeton University Press.

Bickel, D.R. 2002. Robust estimators of the mode and skewness of continuous data. Computational Statistics & Data Analysis 39: 153-163.

Bickel, D.R. and R. Frühwirth. 2006. On a fast, robust estimator of the mode: comparisons to other estimators with applications. Computational Statistics & Data Analysis 50: 3500-3530.

Dalenius, T. 1965. The mode - A neglected statistical parameter. Journal, Royal Statistical Society A 128: 110-117.

Grübel, R. 1988. The length of the shorth. Annals of Statistics 16: 619-628.

Hampel, F.R. 1975. Beyond location parameters: robust concepts and methods. Bulletin, International Statistical Institute 46: 375-382.

Maronna, R.A., R.D. Martin and V.J. Yohai. 2006. Robust statistics: theory and methods. Chichester: John Wiley.

Robertson, T. and J.D. Cryer. 1974. An iterative procedure for estimating the mode. Journal, American Statistical Association 69: 1012-1016.

Rousseeuw, P.J. 1984. Least median of squares regression. Journal, American Statistical Association 79: 871-880.

Rousseeuw, P.J. and A.M. Leroy. 1987. Robust regression and outlier detection. New York: John Wiley.

Also see

Online: egen, kdensity, means, modes (if installed), shorth (if installed)