```-------------------------------------------------------------------------------
help for shorth
-------------------------------------------------------------------------------

Descriptive statistics based on shortest halves

shorth varlist [if exp] [in range] [, proportion(#) allobs
format(format) name(#) spaces(#) ties ]

shorth varname [if exp] [in range] [, proportion(#) allobs by(byvar)
missing format(format) name(#) spaces(#) ties
generate(specification) ]

by ... : may also be used with shorth: see help on by.

Description

shorth calculates descriptive statistics for varlist based on the
shortest half of the distribution of each variable or group specified:
the shorth, the mean of values in that shortest half; the midpoint of
that half, which is the least median of squares estimate of location; and
the length of the shortest half.

Remarks

The order statistics of a sample of n values of x are defined by

x(1) <= x(2) <= ... <= x(n-1) <= x(n).

Let h = floor(n / 2). Then the shortest half of the data from rank k to
rank k + h is identified to minimise

x(k + h) - x(k)

over k = 1, ..., n - h. This interval we call the length of the shortest
half. The "shorth" was named by J.W. Tukey and introduced in the
Princeton robustness study of estimators of location by Andrews, Bickel,
Hampel, Huber, Rogers and Tukey (1972, p.26) as the mean of x(k), ...,
x(k + h). It attracted attention for its unusual asymptotic properties
(pp.50-52): on those, see also the later accounts of Shorack and Wellner
(1986, pp.767-771) and Kim and Pollard (1990). Otherwise it quickly
shows that results available to the Princeton study on asymmetric
situations, but not fully analysed at the time, put the shorth in better
light than was then appreciated.

Interest revived in such ideas when Rousseeuw (1984), building on a
suggestion by Hampel (1975), pointed out that the midpoint of the
shortest half (x(k) + x(k + h)) / 2 is the least median of squares (LMS)
estimator of location for x. See Rousseeuw (1984) and Rousseeuw and Leroy
(1987) for applications of LMS and related ideas to regression and other
problems.  Note that this LMS midpoint is also called the shorth in some
recent literature (e.g.  Maronna, Martin and Yohai 2006, p.48). Further,
the shortest half itself is also sometimes called the shorth, as the
title of Grübel (1988) indicates.

The length of the shortest half is a robust measure of scale or spread:
see Rousseeuw and Leroy (1988), Grübel (1988), Rousseeuw and Croux (1993)
and Martin and Zamar (1993) for further analysis and discussion.

The length of the shortest half in a Gaussian (normal) with mean 0 and
standard deviation 1 is in Stata language 2 * invnorm(0.75), which is
1.349 to 3 d.p. Thus to estimate standard deviation from the observed
length, divide by this Gaussian length.

shortest half ideas, from the standpoint of practical data analysts as
much as mathematical or theoretical statisticians.  Whatever the project,
it will always be wise to compare shorth results with standard summary
measures (including other means, notably geometric and harmonic means)
and to relate results to graphs of distributions. Moreover, if your
interest is in the existence or extent of bimodality or multimodality, it
will be best to look directly at suitably smoothed estimates of the
density function.

1. Simplicity  The idea of the shortest half is simple and easy to
explain to students and researchers who do not regard themselves as
statistical specialists. It leads directly to two measures of location
and one of spread that are fairly intuitive. It is also relatively
amenable to hand calculation with primitive tools (pencil and paper,

2. Connections  The similarities and differences between the length of
the shortest half, the interquartile range and the median absolute
deviation from the median (MAD) (or for that matter the probable error)
are immediate. Thus, shortest half ideas are linked to other statistical
ideas that should already be familiar to many data analysts.

3. Graphic interpretation  The shortest half can easily be related to
standard displays of distributions such as cumulative distribution and
quantile plots, histograms and stem-and-leaf plots.

4. Mode  By averaging where the data are densest, the shorth and also the
LMS midpoint introduce a mode flavour to summary of location.  When
applied to distributions that are approximately symmetric, the shorth
will be close to the mean and median, but more resistant than the mean to
outliers in either tail and more efficient than the mean for
distributions near Gaussian (normal) in shape. When applied to
distributions that are unimodal and asymmetric, the shorth and the LMS
will typically be nearer the mode than either the mean or the median.
Note that the idea of estimating the mode as the midpoint of the shortest
interval that contains a fixed number of observations goes back at least
and Bickel and Frühwirth (2006) on other estimators of the mode. The
half-sample mode estimator of Bickel and Frühwirth is especially
interesting as a recursive selection of the shortest half. For a Stata
implementation and more detail, see hsmode from SSC.

5. Outlier identification  A resistant standardisation such as (value -
shorth) / length may help in identifying outliers. For discussions of
related ideas, see Carey et al.  (1997) and included references.

6. Generalise to shortest fraction  The idea can be generalised to
proportions other than one-half.

At the same time, note that

7. Not useful for all distributions  When applied to distributions that
are approximately J-shaped, the shorth will approximate the mean of the
lower half of the data and the LMS midpoint will be rather higher. When
applied to distributions that are approximately U-shaped, the shorth and
the LMS midpoint will be within whichever half of the distribution
happens to have higher average density. Neither behaviour seems
especially interesting or useful, but equally there is little call for
single mode-like summaries for J-shaped or U-shaped distributions; for J
shapes, the mode is, or should be, the minimum and for U shapes,
bimodality makes the idea of a single mode moot, if not invalid.

8. Interpretation under asymmetry  If applied knowingly to asymmetric
distributions, the query may be raised: What do you think you are
estimating? That is, the target for an estimator of location is not well
defined whenever there is no longer an unequivocal centre to a
distribution. This is a good question.  Three possible answers: I am not
estimating anything, but doing descriptive statistics.  I am estimating
the mode. What is being estimated should be defined in terms of the
estimator (compare Huber 1972).

9. Ties  The shortest half may not be uniquely defined. Even with
measured data, rounding of reported values may frequently give rise to
ties. What to do with two or more shortest halves has been little
discussed in the literature. Note that tied halves may either overlap or
be disjoint.

The procedure adopted in shorth given t ties is to report the
existence of ties and then to use the middlemost in order, except
that that is in turn not uniquely defined unless t is odd.  The
middlemost is arbitrarily taken to have position ceiling(t / 2) in
order, counting upwards. This is thus the 1st of 2, the 2nd of 3 or
4, and so forth.

This tie-break rule has some quirky consequences. Thus with values -9
-4 -1 0 -1 4 9, there is a tie for shortest half between -4 -1 0 1
and -1 0 1 4. The rule yields -1 as the shorth, not 0 as would be
natural on all other grounds. Otherwise put, this problem can arise
because for a window to be placed symmetrically around the order
statistics that define the median the window length 1 + floor(n / 2)
must be odd for odd n and even for even n, which is difficult to
achieve given other desiderata, notably that window length should
never decrease with sample size.

Apart from reporting that the shortest half is indeterminate, other
possibilities would be reporting the average of the union of tied
halves or the average of the averages of the tied halves for the
shorth, and similarly for the LMS midpoint. See for example Carey et
al.  (1997), who average the midpoints. One merit of the tie-break
rule here is that the shorth and LMS reported are always for a
predictable number of values, by default 1 + floor(n / 2).

10. Rationale for window length  Why half is taken to mean 1 + floor(n /
2) also does not appear to be discussed. Evidently we need a rule that
yields a window length for both odd and even n; it is preferable that the
rule be simple; and there is usually some slight arbitrariness in
choosing a rule of this kind. It is also important that any rule behave
reasonably for small n: even if a program is not deliberately invoked for
very small sample sizes the procedure used should make sense for all
possible sizes. Note that, with this rule, given n = 1 the shorth is just
the single sample value, and given n = 2 the shorth is the average of the
defines a slight majority, thus enforcing democratic decisions about the
data.  However, there seems no strong reason not to use ceiling(n / 2) as
an even simpler rule, except that all authors on the shorth appear to
have followed 1 + floor(n / 2).

11. Use with weighted data  Identification of the shortest half would
seem to extend only rather messily to situations in which observations
are associated with unequal weights and is thus not attempted here.

12. Length when most values identical  When at least half of the values
in a sample are equal to some constant, the length of the shortest half
is 0. So, for example, if most values are 0 and some are larger, the
length of the shortest half is not particularly useful as a measure of

Options

proportion(#) specifies a proportion other than 0.5 defining a shortest
fraction. That is the window length will be 1 + floor(proportion *
n). This is a rarely specified option.

allobs specifies use of the maximum possible number of observations for
each variable. The default is to use only those observations for
which all variables in varlist are not missing.

by() specifies a variable defining distinct groups for which statistics
should be calculated. by() is allowed only with a single varname. The
choice between by: and by() is partly one of precisely what kind of
output display is required. The display with by: is clearly
structured by groups while that with by() is more compact. To show
statistics for several variables and several groups with a single
call to shorth, the display with by: is essential.

missing specifies that with the by() option observations with missing
values of byvar should be included in calculations. The default is to
exclude them.

format(format) specifies a numeric format for displaying summary
statistics. The default is %8.2g.

name(#) specifies a maximum length for showing variable names (or in the
case of by() values or value labels) in the display of results. The
default is 32.

spaces(#) specifies the number of spaces to be shown between columns of
results. The default is 2.

ties requests a specification of which intervals tie for shortest half.
The ranks of the starting points k will be shown.

generate() specifies one or more new variables to hold calculated
results. generate() is allowed only with a single varname. This
option is most useful when you want to save statistics calculated for
several groups for further analysis. Note that generate() is not
allowed with the by: prefix: use the by() option instead.  Values for
the new variables will necessarily be identical for all observations
in each group: typically it will be useful to select just one
observation for each group, say by using egen, tag().

The specification consists of one or more space-separated elements
newvar=statistic, where newvar is a new variable name and statistic
is one of shorth, min, LMS or lms, max and length.

Examples

. shorth price-foreign

. bysort rep78: shorth mpg

. shorth mpg, by(rep78) generate(s=shorth LMS=LMS)

Saved results

(for last-named variable or group only)

r(N)         n
r(shorth)    shorth
r(min)       minimum in shortest half
r(rank_min)  rank of minimum
r(LMS)       LMS (midpoint of shortest half)
r(max)       maximum in shortest half
r(rank_max)  rank of maximum
r(length)    length of shortest half

Author

Nicholas J. Cox, Durham University, U.K.
n.j.cox@durham.ac.uk

References

Andrews, D.F., P.J. Bickel, F.R. Hampel, P.J. Huber, W.H.  Rogers and
J.W. Tukey. 1972.  Robust estimates of location: survey and advances.
Princeton, NJ: Princeton University Press.

Bickel, D.R. 2002.  Robust estimators of the mode and skewness of
continuous data.  Computational Statistics & Data Analysis 39:
153-163.

Bickel, D.R. and R. Frühwirth. 2006.  On a fast, robust estimator of the
mode: comparisons to other estimators with applications.
Computational Statistics & Data Analysis 50: 3500-3530.

Carey, V.J., E.E. Walters, C.G. Wager and B.A. Rosner. 1997.  Resistant
and test-based outlier rejection: effects on Gaussian one- and
two-sample inference.  Technometrics 39: 320-330.

Christmann, A., U. Gather and G. Scholz. 1994.  Some properties of the
length of the shortest half.  Statistica Neerlandica 48: 209-213.

Dalenius, T. 1965.  The mode - A neglected statistical parameter.
Journal, Royal Statistical Society A 128: 110-117.

Grübel, R. 1988.  The length of the shorth.  Annals of Statistics 16:
619-628.

Hampel, F.R. 1975.  Beyond location parameters: robust concepts and
methods.  Bulletin, International Statistical Institute 46: 375-382.

Hampel, F.R. 1997.  Some additional notes on the "Princeton robustness
year".  In Brillinger, D.R., L.T. Fernholz and S. Morgenthaler (eds)
The practice of data analysis: essays in honor of John W. Tukey.
Princeton, NJ: Princeton University Press, 133-153.

Huber, P.J. 1972.  Robust statistics: a review.  Annals of Mathematical
Statistics 43: 1041-1067.

Kim, J. and D. Pollard. 1990.  Cube root asymptotics.  Annals of
Statistics 18: 191-219.

Maronna, R.A., R.D. Martin and V.J. Yohai. 2006.  Robust statistics:
theory and methods.  Chichester: John Wiley.

Martin, R.D. and R.H. Zamar. 1993.  Bias robust estimation of scale.
Annals of Statistics 21: 991-1017.

Robertson, T. and J.D. Cryer. 1974.  An iterative procedure for
estimating the mode.  Journal, American Statistical Association 69:
1012-1016.

Rousseeuw, P.J. 1984.  Least median of squares regression.  Journal,
American Statistical Association 79: 871-880.

Rousseeuw, P.J. and C. Croux. 1993.  Alternatives to the median absolute
deviation.  Journal, American Statistical Association 88: 1273-1283.

Rousseeuw, P.J. and A.M. Leroy. 1987.  Robust regression and outlier
detection.  New York: John Wiley.

Rousseeuw, P.J. and A.M. Leroy. 1988.  A robust scale estimator based on
the shortest half.  Statistica Neerlandica 42: 103-116.

Shorack, G.R. and J.A. Wellner. 1986.  Empirical processes with
applications to statistics.  New York: John Wiley.

Also see

Online:  egen, kdensity, means, hsmode (if installed), modes (if
installed)

```