Grouping or clustering in one dimension
group1d varname [if exp] [in range] , max(#) [ generate(specification) ]
Description
group1d groups or clusters values of varname, taken to be ordered in one dimension. Natural examples are the values of a variable sorted from smallest to largest, the values of a time series, or the values of a spatial series along or down a profile, transect, core, borehole, etc. n values are clustered into one or more contiguous groups, the (k - 1) boundaries between k groups being chosen to minimise the sum of the within-cluster sums of squared deviations from cluster means over the comb(n - 1, k - 1) possible clusterings. The clustering produced is guaranteed optimal, but it may not be unique.
Remarks
group1d ignores missing values of varname and any observations not satisfying any if or in conditions specified. Otherwise it takes the current sort order of observations to convey the ordering desired. For easier interpretation of results, drop any unwanted observations or sort them to the end of your dataset.
Cluster analysis with a single variable makes perfect sense whenever there is some dimension along which values can be arranged. This could be a measurement scale, time or space. Given ordered data, there might be interest in looking for relative breaks within a frequency distribution (antimodes, in one terminology). A time series could be divided into spells, epochs, periods, whatever, ideally with relatively small differences within subseries and relatively large differences between subseries. The same problem arises whenever a single spatial dimension (horizontal or vertical) is to be subdivided. In geological and other sciences, this is often studied under the heading of zonation.
Note that any formal clustering should always be accompanied by appropriate plotting of the data (for example, using a dot or quantile or line plot), which indeed may make clear either that breaks are obvious (so that formal clustering is merely decorative) or that convincing breaks do not exist (so that formal clustering may be pointless).
Consider a toy example of values ordered by magnitude:
14 15 16 23 24 25 56 57 58
where it is evident that a three-group clustering
14 15 16 | 23 24 25 | 56 57 58
is sensible. Whether the ordering is on the values themselves, or in time or in space, the data can always be laid out in one dimension, which gives special structure to the problem. Thus, although more general clustering methods can be used, that special structure ideally should be exploited. k groups devised for n values are defined by placing (k - 1) markers (in the example above, k - 1 = 2); there are (n - 1) possible places to put them. There are thus comb(n - 1, k - 1) possible clusterings. However, if k is free to vary, then the total number of possible clusterings is 2^(n - 1), as each value can be in the same group as each neighbour, or not. For even modest n, that is a large number.
The problem can be made precise (Fisher 1958; Hartigan 1975) by placing markers to minimise, for a given number of groups, the
sum over groups of variability around group centres.
A sum of squared deviations from group means will spring to mind as the most obvious possibility, and this is implemented in group1d. Sum of absolute deviations from group medians, and other measures, might well be entertained.
Hartigan (1975) shows how a dynamic programming approach makes such computation straightforward. group1d owes its origin to Hartigan's original Fortran.
Vignettes
Walter Dummer Fisher (1916-1995) was an American economist and econometrician. He gained a first degree in history from Harvard in 1937 and a doctorate from Chicago in economics on the demand for lemons in 1943. Fisher worked in the Department of Agriculture, the United States Air Force, the University of California, Kansas State University, the National Bureau of Economic Research and Northwestern University.
John Anthony Hartigan (1937- ) is an Australian statistician. He was born in Sydney, gained degrees in mathematics from the University of Sydney and in statistics from Princeton, and has worked at Princeton, Cambridge and (since 1969) Yale. Hartigan's interests include classification and clustering, Bayesian statistics and statistical computing and graphics. He has been credited with the invention of the scatter plot matrix.
Options
max() specifies the maximum number of clusters to determine. Optimal clusterings for fewer clusters will automatically be shown. It is usual to set max() to rather larger than you desire and inspect results for indications of a good number of clusters. This option is required. Note that the amount of calculation by the program increases with the maximum specified.
generate() specifies that one or more new variables are to be produced specifying cluster membership. The specification rule is best explained by example. generate(g3=3 g4=4) specifies that new variable g3 should be generated showing the membership of the 3-group clustering and new variable g4 should be generated showing the membership of the 4-group clustering. The values of the new variables will be 1, 2, 3 and 1, 2, 3, 4 respectively indicating membership of groups 1, 2, 3 or 4. The specification for each new variable should not include spaces and the specifications for two or more variables should be separated by spaces. This option may be used in a second application of group1d after a first has indicated an interesting or useful grouping.
Examples
. group1d myvar, max(4) . group1d myvar, max(3) gen(g3=3)
Author
Nicholas J. Cox, Durham University n.j.cox@durham.ac.uk
References
Barry, D. 2005. A conversation with John Hartigan. Statistical Science 20: 418-430.
Fisher, W.D. 1958. On grouping for maximum homogeneity. Journal, American Statistical Association 53: 789-98 [will be accessible to some via www.jstor.org]
Hartigan, J.A. 1975. Clustering algorithms. New York: John Wiley. Ch.6.