Cluster Analysis ----------------

cluster ^varlist^ if exp in exp, gen(varname) [ groups(#) iter(#) dist(#) medoid std first equal keep]

^Cluster^ performs nonhierarchical k-means (or k-medoids) cluster analysis of your data.

Centroid cluster analysis is a simple method that groups cases based on their proximity to a multidimensional centroid or medoid. The fitting method proceeds in several steps:

1) Cases are placed (via one of three methods) into 1 of the # of groups you specified. 2) The mean within each group for each variable in ^varlist^ is calculated. The vector consisting of all these means is the "centroid" of the group. 3) The distance of each case from each of the centroids is calculated. 4) Cases are reclassified into the group corresponding to the centroid closest to their position. 5) Go to step two unless there is no change in cluster assignments from one iteration to the next.

Number of clusters ------------------ For ^cluster^, the number of groups to be extracted must be specified with the ^groups(#)^ option. The default number of clusters is two. If, for a case, any of the variables in ^varlist^ is missing, the cluster for that case is coded as missing.

Distance measures ----------------- Specifying a real number with the dist(#) option tells ^cluster^ how to calculate distances from centroids in multidimensional space. Distances are calculated according to the Minkowski metric, viz.,

p 1/p Distance = { Sum |X - X'| } for p>=1.

The default value is 2, that is, the "usual" Euclidean distance measure. An alternative that is particularly reasonable for categorical data is p=1, the city-block or absolute value metric.

Display option -------------- If you specify ^display^, a two-dimensional plot of the first two variables in ^varlist^ with the cluster number is performed at each iteration. This allows you to watch the convergence process over iterations.

Medoid option ------------- The default for ^cluster^ is multidimensional centroids, i.e. means. However, a more robust method is to use multidimensional medoids, i.e. medians. If you choose the medoid option, distances will be calculated from medians rather than means. All other aspects of operation are unchanged.

Standardize option ------------------ If you specify the ^std^ (standardize) option, all variables in ^varlist^ are standardized as z-scores, i.e., the standardized value equals the initial value minus its mean, this quantity divided by its standard deviation. The default is no standardization, but beware of this when the variables in ^varlist^ are measured on different scales. See Kaufman and Rousseeuw (1990:9-11), ^Finding groups in data: An introduction to cluster analysis^ for some discussion of standardization pro's and con's.

First and equal options ----------------------- The ^first^ and ^equal^ options control how ^cluster^ assigns cases to clusters before its first reassignment pass. The default method is to randomly assign cases to clusters with probability 1/k, where k is the number of clusters. From these randomly constructed clusters, the first centroids or medoids are calculated, and reassignment passes begin. If you specify the ^first^ option, the first k cases with no missing values will be taken as starting centroids or medoids. If you specify the ^equal^ option, cases will be assigned systematically to clusters, the first case to the first cluster, second case to the second cluster, and so on until all cases are assigned to the k clusters in approximately equal numbers. Because the ^first^ option solutions are dependent upon the sort order of the data, the ^first^ option is not especially recommended.

Keep option ----------- If you specify the ^keep^ option, the variables generated during the algorithm will be kept with your data set. These variables include CENTij (or MEDij) and DISTi, where "i" is an integer indicating centroid number "i" and "j" is an integer indicating variable "j" from ^varlist^.

Output ------ There is one output produced by the ^cluster^ program. An iteration log is printed as reassignment passes occur, and, upon convergence, summary statistics of each cluster are presented.

Alternative output could include: 1) Graphing cluster membership against variables. A simple method would be to use the command "graph var1 var2, s([cluster name])" where cluster name is the name of the variable generated by cluster. If you have installed the ^gr3^ command (STB-2, gr3), you could plot in 3 dimensions.

2) listing the case, cluster membership, distance and centroid, for all cases.

NOTE: A known bug in this program is, occasionally, total sum of distances will increase after a reassignment pass. This should not happen.

This program is based on Anderberg, Michael (1973). Cluster analysis for applications. NY, NY: Academic press. Chapter 7, Forgy's method, page 161.

Author contact information is in the ado file.