Descriptive statistics based on shortest halves
shorth varlist [if exp] [in range] [, proportion(#) allobs format(format) name(#) spaces(#) ties ]
shorth varname [if exp] [in range] [, proportion(#) allobs by(byvar) missing format(format) name(#) spaces(#) ties generate(specification) ]
by ... : may also be used with shorth: see help on by.
Description
shorth calculates descriptive statistics for varlist based on the shortest half of the distribution of each variable or group specified: the shorth, the mean of values in that shortest half; the midpoint of that half, which is the least median of squares estimate of location; and the length of the shortest half.
Remarks
The order statistics of a sample of n values of x are defined by
x(1) <= x(2) <= ... <= x(n-1) <= x(n).
Let h = floor(n / 2). Then the shortest half of the data from rank k to rank k + h is identified to minimise
x(k + h) - x(k)
over k = 1, ..., n - h. This interval we call the length of the shortest half. The "shorth" was named by J.W. Tukey and introduced in the Princeton robustness study of estimators of location by Andrews, Bickel, Hampel, Huber, Rogers and Tukey (1972, p.26) as the mean of x(k), ..., x(k + h). It attracted attention for its unusual asymptotic properties (pp.50-52): on those, see also the later accounts of Shorack and Wellner (1986, pp.767-771) and Kim and Pollard (1990). Otherwise it quickly dropped out of sight for about a decade. Incidentally, Hampel (1997) shows that results available to the Princeton study on asymmetric situations, but not fully analysed at the time, put the shorth in better light than was then appreciated.
Interest revived in such ideas when Rousseeuw (1984), building on a suggestion by Hampel (1975), pointed out that the midpoint of the shortest half (x(k) + x(k + h)) / 2 is the least median of squares (LMS) estimator of location for x. See Rousseeuw (1984) and Rousseeuw and Leroy (1987) for applications of LMS and related ideas to regression and other problems. Note that this LMS midpoint is also called the shorth in some recent literature (e.g. Maronna, Martin and Yohai 2006, p.48). Further, the shortest half itself is also sometimes called the shorth, as the title of Grübel (1988) indicates.
The length of the shortest half is a robust measure of scale or spread: see Rousseeuw and Leroy (1988), Grübel (1988), Rousseeuw and Croux (1993) and Martin and Zamar (1993) for further analysis and discussion.
The length of the shortest half in a Gaussian (normal) with mean 0 and standard deviation 1 is in Stata language 2 * invnorm(0.75), which is 1.349 to 3 d.p. Thus to estimate standard deviation from the observed length, divide by this Gaussian length.
Some broad-brush comments follow on advantages and disadvantages of shortest half ideas, from the standpoint of practical data analysts as much as mathematical or theoretical statisticians. Whatever the project, it will always be wise to compare shorth results with standard summary measures (including other means, notably geometric and harmonic means) and to relate results to graphs of distributions. Moreover, if your interest is in the existence or extent of bimodality or multimodality, it will be best to look directly at suitably smoothed estimates of the density function.
1. Simplicity The idea of the shortest half is simple and easy to explain to students and researchers who do not regard themselves as statistical specialists. It leads directly to two measures of location and one of spread that are fairly intuitive. It is also relatively amenable to hand calculation with primitive tools (pencil and paper, calculators, spreadsheets).
2. Connections The similarities and differences between the length of the shortest half, the interquartile range and the median absolute deviation from the median (MAD) (or for that matter the probable error) are immediate. Thus, shortest half ideas are linked to other statistical ideas that should already be familiar to many data analysts.
3. Graphic interpretation The shortest half can easily be related to standard displays of distributions such as cumulative distribution and quantile plots, histograms and stem-and-leaf plots.
4. Mode By averaging where the data are densest, the shorth and also the LMS midpoint introduce a mode flavour to summary of location. When applied to distributions that are approximately symmetric, the shorth will be close to the mean and median, but more resistant than the mean to outliers in either tail and more efficient than the mean for distributions near Gaussian (normal) in shape. When applied to distributions that are unimodal and asymmetric, the shorth and the LMS will typically be nearer the mode than either the mean or the median. Note that the idea of estimating the mode as the midpoint of the shortest interval that contains a fixed number of observations goes back at least to Dalenius (1965). See also Robertson and Cryer (1974), Bickel (2002) and Bickel and Frühwirth (2006) on other estimators of the mode. The half-sample mode estimator of Bickel and Frühwirth is especially interesting as a recursive selection of the shortest half. For a Stata implementation and more detail, see hsmode from SSC.
5. Outlier identification A resistant standardisation such as (value - shorth) / length may help in identifying outliers. For discussions of related ideas, see Carey et al. (1997) and included references.
6. Generalise to shortest fraction The idea can be generalised to proportions other than one-half.
At the same time, note that
7. Not useful for all distributions When applied to distributions that are approximately J-shaped, the shorth will approximate the mean of the lower half of the data and the LMS midpoint will be rather higher. When applied to distributions that are approximately U-shaped, the shorth and the LMS midpoint will be within whichever half of the distribution happens to have higher average density. Neither behaviour seems especially interesting or useful, but equally there is little call for single mode-like summaries for J-shaped or U-shaped distributions; for J shapes, the mode is, or should be, the minimum and for U shapes, bimodality makes the idea of a single mode moot, if not invalid.
8. Interpretation under asymmetry If applied knowingly to asymmetric distributions, the query may be raised: What do you think you are estimating? That is, the target for an estimator of location is not well defined whenever there is no longer an unequivocal centre to a distribution. This is a good question. Three possible answers: I am not estimating anything, but doing descriptive statistics. I am estimating the mode. What is being estimated should be defined in terms of the estimator (compare Huber 1972).
9. Ties The shortest half may not be uniquely defined. Even with measured data, rounding of reported values may frequently give rise to ties. What to do with two or more shortest halves has been little discussed in the literature. Note that tied halves may either overlap or be disjoint.
The procedure adopted in shorth given t ties is to report the existence of ties and then to use the middlemost in order, except that that is in turn not uniquely defined unless t is odd. The middlemost is arbitrarily taken to have position ceiling(t / 2) in order, counting upwards. This is thus the 1st of 2, the 2nd of 3 or 4, and so forth.
This tie-break rule has some quirky consequences. Thus with values -9 -4 -1 0 -1 4 9, there is a tie for shortest half between -4 -1 0 1 and -1 0 1 4. The rule yields -1 as the shorth, not 0 as would be natural on all other grounds. Otherwise put, this problem can arise because for a window to be placed symmetrically around the order statistics that define the median the window length 1 + floor(n / 2) must be odd for odd n and even for even n, which is difficult to achieve given other desiderata, notably that window length should never decrease with sample size.
Apart from reporting that the shortest half is indeterminate, other possibilities would be reporting the average of the union of tied halves or the average of the averages of the tied halves for the shorth, and similarly for the LMS midpoint. See for example Carey et al. (1997), who average the midpoints. One merit of the tie-break rule here is that the shorth and LMS reported are always for a predictable number of values, by default 1 + floor(n / 2).
10. Rationale for window length Why half is taken to mean 1 + floor(n / 2) also does not appear to be discussed. Evidently we need a rule that yields a window length for both odd and even n; it is preferable that the rule be simple; and there is usually some slight arbitrariness in choosing a rule of this kind. It is also important that any rule behave reasonably for small n: even if a program is not deliberately invoked for very small sample sizes the procedure used should make sense for all possible sizes. Note that, with this rule, given n = 1 the shorth is just the single sample value, and given n = 2 the shorth is the average of the two sample values. A further detail about this rule is that it always defines a slight majority, thus enforcing democratic decisions about the data. However, there seems no strong reason not to use ceiling(n / 2) as an even simpler rule, except that all authors on the shorth appear to have followed 1 + floor(n / 2).
11. Use with weighted data Identification of the shortest half would seem to extend only rather messily to situations in which observations are associated with unequal weights and is thus not attempted here.
12. Length when most values identical When at least half of the values in a sample are equal to some constant, the length of the shortest half is 0. So, for example, if most values are 0 and some are larger, the length of the shortest half is not particularly useful as a measure of scale or spread.
Options
proportion(#) specifies a proportion other than 0.5 defining a shortest fraction. That is the window length will be 1 + floor(proportion * n). This is a rarely specified option.
allobs specifies use of the maximum possible number of observations for each variable. The default is to use only those observations for which all variables in varlist are not missing.
by() specifies a variable defining distinct groups for which statistics should be calculated. by() is allowed only with a single varname. The choice between by: and by() is partly one of precisely what kind of output display is required. The display with by: is clearly structured by groups while that with by() is more compact. To show statistics for several variables and several groups with a single call to shorth, the display with by: is essential.
missing specifies that with the by() option observations with missing values of byvar should be included in calculations. The default is to exclude them.
format(format) specifies a numeric format for displaying summary statistics. The default is %8.2g.
name(#) specifies a maximum length for showing variable names (or in the case of by() values or value labels) in the display of results. The default is 32.
spaces(#) specifies the number of spaces to be shown between columns of results. The default is 2.
ties requests a specification of which intervals tie for shortest half. The ranks of the starting points k will be shown.
generate() specifies one or more new variables to hold calculated results. generate() is allowed only with a single varname. This option is most useful when you want to save statistics calculated for several groups for further analysis. Note that generate() is not allowed with the by: prefix: use the by() option instead. Values for the new variables will necessarily be identical for all observations in each group: typically it will be useful to select just one observation for each group, say by using egen, tag().
The specification consists of one or more space-separated elements newvar=statistic, where newvar is a new variable name and statistic is one of shorth, min, LMS or lms, max and length.
Examples
. shorth price-foreign
. bysort rep78: shorth mpg
. shorth mpg, by(rep78) generate(s=shorth LMS=LMS)
Saved results
(for last-named variable or group only)
r(N) n r(shorth) shorth r(min) minimum in shortest half r(rank_min) rank of minimum r(LMS) LMS (midpoint of shortest half) r(max) maximum in shortest half r(rank_max) rank of maximum r(length) length of shortest half
Author
Nicholas J. Cox, Durham University, U.K. n.j.cox@durham.ac.uk
References
Andrews, D.F., P.J. Bickel, F.R. Hampel, P.J. Huber, W.H. Rogers and J.W. Tukey. 1972. Robust estimates of location: survey and advances. Princeton, NJ: Princeton University Press.
Bickel, D.R. 2002. Robust estimators of the mode and skewness of continuous data. Computational Statistics & Data Analysis 39: 153-163.
Bickel, D.R. and R. Frühwirth. 2006. On a fast, robust estimator of the mode: comparisons to other estimators with applications. Computational Statistics & Data Analysis 50: 3500-3530.
Carey, V.J., E.E. Walters, C.G. Wager and B.A. Rosner. 1997. Resistant and test-based outlier rejection: effects on Gaussian one- and two-sample inference. Technometrics 39: 320-330.
Christmann, A., U. Gather and G. Scholz. 1994. Some properties of the length of the shortest half. Statistica Neerlandica 48: 209-213.
Dalenius, T. 1965. The mode - A neglected statistical parameter. Journal, Royal Statistical Society A 128: 110-117.
Grübel, R. 1988. The length of the shorth. Annals of Statistics 16: 619-628.
Hampel, F.R. 1975. Beyond location parameters: robust concepts and methods. Bulletin, International Statistical Institute 46: 375-382.
Hampel, F.R. 1997. Some additional notes on the "Princeton robustness year". In Brillinger, D.R., L.T. Fernholz and S. Morgenthaler (eds) The practice of data analysis: essays in honor of John W. Tukey. Princeton, NJ: Princeton University Press, 133-153.
Huber, P.J. 1972. Robust statistics: a review. Annals of Mathematical Statistics 43: 1041-1067.
Kim, J. and D. Pollard. 1990. Cube root asymptotics. Annals of Statistics 18: 191-219.
Maronna, R.A., R.D. Martin and V.J. Yohai. 2006. Robust statistics: theory and methods. Chichester: John Wiley.
Martin, R.D. and R.H. Zamar. 1993. Bias robust estimation of scale. Annals of Statistics 21: 991-1017.
Robertson, T. and J.D. Cryer. 1974. An iterative procedure for estimating the mode. Journal, American Statistical Association 69: 1012-1016.
Rousseeuw, P.J. 1984. Least median of squares regression. Journal, American Statistical Association 79: 871-880.
Rousseeuw, P.J. and C. Croux. 1993. Alternatives to the median absolute deviation. Journal, American Statistical Association 88: 1273-1283.
Rousseeuw, P.J. and A.M. Leroy. 1987. Robust regression and outlier detection. New York: John Wiley.
Rousseeuw, P.J. and A.M. Leroy. 1988. A robust scale estimator based on the shortest half. Statistica Neerlandica 42: 103-116.
Shorack, G.R. and J.A. Wellner. 1986. Empirical processes with applications to statistics. New York: John Wiley.
Also see
Online: egen, kdensity, means, hsmode (if installed), modes (if installed)