{smcl}
{* 26feb2007}{...}
{hline}
help for {hi:hsmode}
{hline}
{title:Half-sample modes}
{p 8 17 2}{cmd:hsmode}
{it:varlist}
[{cmd:if} {it:exp}]
[{cmd:in} {it:range}]
[{cmd:,}
{cmdab:all:obs}
{cmdab:f:ormat(}{it:format}{cmd:)}
{cmdab:n:ame(}{it:#}{cmd:)}
{cmdab:s:paces(}{it:#}{cmd:)}
]
{p 8 17 2}{cmd:hsmode}
{it:varname}
[{cmd:if} {it:exp}]
[{cmd:in} {it:range}]
[{cmd:,}
{cmdab:all:obs}
{cmd:by(}{it:byvar}{cmd:)}
{cmdab:miss:ing}
{cmdab:f:ormat(}{it:format}{cmd:)}
{cmdab:n:ame(}{it:#}{cmd:)}
{cmdab:s:paces(}{it:#}{cmd:)}
{cmdab:g:enerate(}{it:newvar}{cmd:)}
]
{p 4 4 2}{cmd:by ... :} may also be used with {cmd:hsmode}: see help on
{help by}.
{title:Description}
{p 4 4 2}{cmd:hsmode} calculates half-sample modes for {it:varlist}
based on recursive selection of the half-sample with the shortest length.
Although it has longer roots, this implementation of half-sample modes
is based particularly on the ideas of Bickel and Fr{c u:}hwirth (2006).
{title:Remarks}
{p 4 4 2}The idea of estimating the mode as the midpoint of the shortest
interval that contains a fixed number of observations goes back at least
to Dalenius (1965). See also Robertson and Cryer (1974), Bickel (2002)
and Bickel and Fr{c u:}hwirth (2006) on other estimators of the mode.
{p 4 4 2}The order statistics of a sample of n values of x are defined
by
x(1) <= x(2) <= ... <= x(n-1) <= x(n).
{p 4 4 2}The half-sample mode is here defined using two rules.
{p 4 4 2}Rule 1.
If n = 1, the half-sample mode is x(1).
If n = 2, the half-sample mode is (x(1) + x(2)) / 2.
If n = 3, the half-sample mode is (x(1) + x(2)) / 2
if x(1) and x(2) are closer than x(2) and x(3),
(x(2) + x(3)) / 2 if the opposite is true, and
x(2) otherwise.
{p 4 4 2}Rule 2. If n >= 4, we apply recursive selection until left with
3 or fewer values. First let h_1 = {help floor():floor}(n / 2). The
shortest half of the data from rank k to rank {bind:k + h_1} is
identified to minimise
x(k + h_1) - x(k)
{p 4 4 2}over k = 1, ..., n - h_1. Then the shortest half of
those h_1 + 1 values is identified using h_2 = floor(h_1 / 2), and so
on. To finish, use Rule 1.
{p 4 4 2}The idea of identifying the shortest half is applied in the
"shorth" named by J.W. Tukey and introduced in the Princeton robustness
study of estimators of location by Andrews, Bickel, Hampel, Huber,
Rogers and Tukey (1972, p.26) as the mean of the shortest half-length
x(k), ..., {bind:x(k + h)} for h = floor(n / 2).
Rousseeuw (1984), building on a suggestion by Hampel (1975), pointed out
that the midpoint of the shortest half {bind:(x(k) + x(k + h)) / 2} is
the least median of squares (LMS) estimator of location for x. See
Rousseeuw (1984) and Rousseeuw and Leroy (1987) for applications of LMS
and related ideas to regression and other problems. Note that this LMS
midpoint is also called the shorth in some recent literature (e.g.
Maronna, Martin and Yohai 2006, p.48). Further, the shortest half itself
is also sometimes called the shorth, as the title of Gr{c u:}bel (1988)
indicates. For a Stata implementation and more detail, see {help shorth}
from SSC.
{p 4 4 2}Some broad-brush comments follow on advantages and
disadvantages of half-sample modes, from the standpoint of practical
data analysts as much as mathematical or theoretical statisticians.
Whatever the project, it will always be wise to compare {cmd:hsmode}
results with standard summary measures (e.g. medians or means, including
geometric and harmonic means) and to relate results to graphs of
distributions. Moreover, if your interest is in the existence or extent
of bimodality or multimodality, it will be best to look directly at
suitably smoothed estimates of the density function.
{p 4 4 2}1. {it:Mode estimation}{space 1} By summarizing where the data
are densest, the half-sample mode adds an automated estimator of the
mode to the toolbox. More traditional estimates of the mode based on
identifying peaks on histograms or even kernel density plots are
sensitive to decisions about bin origin or width or kernel type and
kernel half-width and more difficult to automate in any case. When
applied to distributions that are unimodal and approximately symmetric, the
half-sample mode will be close to the mean and median, but more
resistant than the mean to outliers in either tail. When applied to
distributions that are unimodal and asymmetric, the half-sample mode
will typically be much nearer the mode identified by other methods than
either the mean or the median.
{p 4 4 2}2. {it:Simplicity}{space 1} The idea of the half-sample mode is
fairly simple and easy to explain to students and researchers who do not
regard themselves as statistical specialists.
{p 4 4 2}3. {it:Graphic interpretation}{space 1} The half-sample mode can
easily be related to standard displays of distributions such as kernel density
plots, cumulative distribution and quantile plots, histograms and stem-and-leaf
plots.
{p 4 4 2}At the same time, note that
{p 4 4 2}4. {it:Not useful for all distributions}{space 1} When applied
to distributions that are approximately J-shaped, the half-sample mode
will approximate the minimum of the data. When applied to distributions
that are approximately U-shaped, the half-sample mode will be within
whichever half of the distribution happens to have higher average
density. Neither behaviour seems especially interesting or useful, but
equally there is little call for single mode-like summaries for J-shaped
or U-shaped distributions. For U shapes, bimodality makes the idea of a
single mode moot, if not invalid.
{p 4 4 2}5. {it:Ties}{space 1} The shortest half may not be uniquely
defined. Even with measured data, rounding of reported values may
frequently give rise to ties. What to do with two or more shortest
halves has been little discussed in the literature. Note that tied
halves may either overlap or be disjoint.
{p 8 8 2}The procedure adopted in {cmd:hsmode} given t ties is to use
the middlemost in order, except that that is in turn not uniquely
defined unless t is odd. The middlemost is arbitrarily taken to have
position {help ceil():ceiling}(t / 2) in order, counting upwards. This
is thus the 1st of 2, the 2nd of 3 or 4, and so forth.
{p 8 8 2}This tie-break rule has some quirky consequences. Thus with
values -9 -4 -1 0 -1 4 9, the rules yield -0.5 as the half-sample mode,
not 0 as would be natural on all other grounds. Otherwise put, this
problem can arise because for a window to be placed symmetrically the
window length 1 + floor(n / 2) must be odd for odd n and even for even
n, which is difficult to achieve given other desiderata, notably that
window length should never decrease with sample size. We prefer to believe
that this is a minor problem with datasets of reasonable size.
{p 4 4 2}6. {it:Rationale for window length}{space 1} Why half is taken
to mean 1 + floor(n / 2) also does not appear to be discussed. Evidently
we need a rule that yields a window length for both odd and even n; it
is preferable that the rule be simple; and there is usually some slight
arbitrariness in choosing a rule of this kind. It is also important that
any rule behave reasonably for small n: even if a program is not
deliberately invoked for very small sample sizes the procedure used
should make sense for all possible sizes. Note that, given n = 1, the
half-sample mode is just the single sample value, and, given n = 2, it is
the average of the two sample values. A further detail about this rule
is that it always defines a slight majority, thus enforcing democratic
decisions about the data. However, there seems no strong reason not to
use ceiling(n / 2) as an even simpler rule, except that if it makes much
difference, then it is likely that your sample size or variable is
unsuitable for the purpose.
{p 4 4 2}7. {it:Use with weighted data}{space 1} Identification of the
half-sample mode for values associated with unequal weights is not
supported at this time.
{title:Options}
{p 4 8 2}{cmd:allobs} specifies use of the maximum possible number of
observations for each variable. The default is to use only those
observations for which all variables in {it:varlist} are not missing.
{p 4 8 2}{cmd:by()} specifies a variable defining distinct groups for
which statistics should be calculated. {cmd:by()} is allowed only with a
single {it:varname}. The choice between {cmd:by:} and {cmd:by()} is
partly one of precisely what kind of output display is required. The
display with {cmd:by:} is clearly structured by groups while that with
{cmd:by()} is more compact. To show statistics for several variables and
several groups with a single call to {cmd:hsmode}, the display with
{cmd:by:} is essential.
{p 4 8 2}{cmdab:miss:ing} specifies that with the {cmd:by()} option
observations with missing values of {it:byvar} should be included in
calculations. The default is to exclude them.
{p 4 8 2}{cmdab:f:ormat(}{it:format}{cmd:)} specifies a numeric format
for displaying results. The default is %8.2g.
{p 4 8 2}{cmdab:n:ame(}{it:#}{cmd:)} specifies a maximum length for
showing variable names (or in the case of {cmd:by()} values or value
labels) in the display of results. The default is 32.
{p 4 8 2}{cmdab:s:paces(}{it:#}{cmd:)} specifies the number of spaces to
be shown between columns of results. The default is 2.
{p 4 8 2}{cmd:generate()} specifies a new variable to hold
calculated modes. {cmd:generate()} is allowed only with a single
{it:varname}. This option is most useful when you want to save
modes calculated for several groups for further analysis. Note that
{cmd:generate()} is not allowed with the {cmd:by:} prefix: use the
{cmd:by()} option instead. Values for the new variable will
necessarily be identical for all observations in each group: typically
it will be useful to select just one observation for each group, say by
using {help egen:egen, tag()}.
{title:Examples}
{p 4 4 2}Robertson and Cryer (1974, p.1014) reported 35 measurements of
uric acid (in mg/100 ml):
1.6, 3.11, 3.95, 4.2, 4.2, 4.62, 4.62, 4.62, 4.7, 4.87, 5.04, 5.29, 5.3,
5.38, 5.38, 5.38, 5.54, 5.54, 5.63, 5.71, 6.13, 6.38, 6.38, 6.67, 6.69,
6.97, 7.22, 7.72, 7.98, 7.98, 8.74, 8.99, 9.27, 9.74, 10.66.
{cmd:hsmode} reports a mode of 5.38. Robertson and Cryer's own estimates
using a rather different procedure are 5.00, 5.02, 5.04.
{help kdensity}'s default supports {cmd:hsmode} here.
{p 4 8 2}{cmd:. hsmode price-foreign}
{p 4 8 2}{cmd:. bysort rep78: hsmode mpg}
{p 4 8 2}{cmd:. hsmode mpg, by(rep78) generate(hsmode)}
{title:Saved results}
{p 4 4 2}(for last-named variable or group only)
r(N) n
r(hsmode) half-sample mode
{title:Author}
{p 4 4 2}Nicholas J. Cox, Durham University, U.K.{break}
n.j.cox@durham.ac.uk
{title:References}
{p 4 8 2}Andrews, D.F., P.J. Bickel, F.R. Hampel, P.J. Huber, W.H.
Rogers and J.W. Tukey. 1972.
{it:Robust estimates of location: survey and advances.}
Princeton, NJ: Princeton University Press.
{p 4 8 2}Bickel, D.R. 2002.
Robust estimators of the mode and skewness of continuous data.
{it:Computational Statistics & Data Analysis} 39: 153{c -}163.
{p 4 8 2}Bickel, D.R. and R. Fr{c u:}hwirth. 2006.
On a fast, robust estimator of the mode: comparisons to other estimators
with applications.
{it:Computational Statistics & Data Analysis} 50: 3500{c -}3530.
{p 4 8 2}Dalenius, T. 1965.
The mode {c -} A neglected statistical parameter.
{it:Journal, Royal Statistical Society} A 128: 110{c -}117.
{p 4 8 2}Gr{c u:}bel, R. 1988.
The length of the shorth.
{it:Annals of Statistics} 16: 619{c -}628.
{p 4 8 2}Hampel, F.R. 1975.
Beyond location parameters: robust concepts and methods.
{it:Bulletin, International Statistical Institute} 46: 375{c -}382.
{p 4 8 2}Maronna, R.A., R.D. Martin and V.J. Yohai. 2006.
{it:Robust statistics: theory and methods.}
Chichester: John Wiley.
{p 4 8 2}Robertson, T. and J.D. Cryer. 1974.
An iterative procedure for estimating the mode.
{it:Journal, American Statistical Association} 69: 1012{c -}1016.
{p 4 8 2}Rousseeuw, P.J. 1984.
Least median of squares regression.
{it:Journal, American Statistical Association} 79: 871{c -}880.
{p 4 8 2}Rousseeuw, P.J. and A.M. Leroy. 1987.
{it:Robust regression and outlier detection.}
New York: John Wiley.
{title:Also see}
{p 4 13 2}
Online: {help egen},
{help kdensity},
{help means},
{help modes} (if installed),
{help shorth} (if installed)