```-------------------------------------------------------------------------------
help for group1d
-------------------------------------------------------------------------------

Grouping or clustering in one dimension

group1d varname [if exp] [in range] , max(#) [
generate(specification) ]

Description

group1d groups or clusters values of varname, taken to be ordered in one
dimension.  Natural examples are the values of a variable sorted from
smallest to largest, the values of a time series, or the values of a
spatial series along or down a profile, transect, core, borehole, etc. n
values are clustered into one or more contiguous groups, the (k - 1)
boundaries between k groups being chosen to minimise the sum of the
within-cluster sums of squared deviations from cluster means over the
comb(n - 1, k - 1) possible clusterings. The clustering produced is
guaranteed optimal, but it may not be unique.

Remarks

group1d ignores missing values of varname and any observations not
satisfying any if or in conditions specified. Otherwise it takes the
current sort order of observations to convey the ordering desired.  For
easier interpretation of results, drop any unwanted observations or sort
them to the end of your dataset.

Cluster analysis with a single variable makes perfect sense whenever
there is some dimension along which values can be arranged.  This could
be a measurement scale, time or space.  Given ordered data, there might
be interest in looking for relative breaks within a frequency
distribution (antimodes, in one terminology). A time series could be
divided into spells, epochs, periods, whatever, ideally with relatively
small differences within subseries and relatively large differences
between subseries.  The same problem arises whenever a single spatial
dimension (horizontal or vertical) is to be subdivided.  In geological
and other sciences, this is often studied under the heading of zonation.

Note that any formal clustering should always be accompanied by
appropriate plotting of the data (for example, using a dot or quantile or
line plot), which indeed may make clear either that breaks are obvious
(so that formal clustering is merely decorative) or that convincing
breaks do not exist (so that formal clustering may be pointless).

Consider a toy example of values ordered by magnitude:

14 15 16 23 24 25 56 57 58

where it is evident that a three-group clustering

14 15 16 | 23 24 25 | 56 57 58

is sensible. Whether the ordering is on the values themselves, or in time
or in space, the data can always be laid out in one dimension, which
gives special structure to the problem. Thus, although more general
clustering methods can be used, that special structure ideally should be
exploited. k groups devised for n values are defined by placing (k - 1)
markers (in the example above, k - 1 = 2); there are (n - 1) possible
places to put them. There are thus comb(n - 1, k - 1) possible
clusterings. However, if k is free to vary, then the total number of
possible clusterings is 2^(n - 1), as each value can be in the same group
as each neighbour, or not.  For even modest n, that is a large number.

The problem can be made precise (Fisher 1958; Hartigan 1975) by placing
markers to minimise, for a given number of groups, the

sum over groups of variability around group centres.

A sum of squared deviations from group means will spring to mind as the
most obvious possibility, and this is implemented in group1d.  Sum of
absolute deviations from group medians, and other measures, might well be
entertained.

Hartigan (1975) shows how a dynamic programming approach makes such
computation straightforward. group1d owes its origin to Hartigan's
original Fortran.

Vignettes

Walter Dummer Fisher (1916-1995) was an American economist and
econometrician. He gained a first degree in history from Harvard in 1937
and a doctorate from Chicago in economics on the demand for lemons in
1943. Fisher worked in the Department of Agriculture, the United States
Air Force, the University of California, Kansas State University, the
National Bureau of Economic Research and Northwestern University.

John Anthony Hartigan (1937- ) is an Australian statistician. He was born
in Sydney, gained degrees in mathematics from the University of Sydney
and in statistics from Princeton, and has worked at Princeton, Cambridge
and (since 1969) Yale. Hartigan's interests include classification and
clustering, Bayesian statistics and statistical computing and graphics.
He has been credited with the invention of the scatter plot matrix.

Options

max() specifies the maximum number of clusters to determine.  Optimal
clusterings for fewer clusters will automatically be shown.  It is
usual to set max() to rather larger than you desire and inspect
results for indications of a good number of clusters. This option is
required. Note that the amount of calculation by the program
increases with the maximum specified.

generate() specifies that one or more new variables are to be produced
specifying cluster membership. The specification rule is best
explained by example. generate(g3=3 g4=4) specifies that new variable
g3 should be generated showing the membership of the 3-group
clustering and new variable g4 should be generated showing the
membership of the 4-group clustering. The values of the new variables
will be 1, 2, 3 and 1, 2, 3, 4 respectively indicating membership of
groups 1, 2, 3 or 4. The specification for each new variable should
not include spaces and the specifications for two or more variables
should be separated by spaces. This option may be used in a second
application of group1d after a first has indicated an interesting or
useful grouping.

Examples

. group1d myvar, max(4)
. group1d myvar, max(3) gen(g3=3)

Author

Nicholas J. Cox, Durham University
n.j.cox@durham.ac.uk

References

Barry, D. 2005.  A conversation with John Hartigan.  Statistical Science
20: 418-430.

Fisher, W.D. 1958. On grouping for maximum homogeneity.  Journal,
American Statistical Association 53: 789-98 [will be accessible to
some via www.jstor.org]

Hartigan, J.A. 1975.  Clustering algorithms.  New York: John Wiley. Ch.6.

```