Cluster Analysis
----------------
cluster ^varlist^ if exp in exp, gen(varname) [ groups(#) iter(#) dist(#)
medoid std first equal keep]
^Cluster^ performs nonhierarchical k-means (or k-medoids) cluster analysis of
your data.
Centroid cluster analysis is a simple method that groups cases based on their
proximity to a multidimensional centroid or medoid. The fitting method
proceeds in several steps:
1) Cases are placed (via one of three methods) into 1 of the # of groups you
specified.
2) The mean within each group for each variable in ^varlist^ is calculated.
The vector consisting of all these means is the "centroid" of the group.
3) The distance of each case from each of the centroids is calculated.
4) Cases are reclassified into the group corresponding to the centroid
closest to their position.
5) Go to step two unless there is no change in cluster assignments from one
iteration to the next.
Number of clusters
------------------
For ^cluster^, the number of groups to be extracted must be specified with
the ^groups(#)^ option. The default number of clusters is two. If, for a
case, any of the variables in ^varlist^ is missing, the cluster for that
case is coded as missing.
Distance measures
-----------------
Specifying a real number with the dist(#) option tells ^cluster^ how to
calculate distances from centroids in multidimensional space. Distances are
calculated according to the Minkowski metric, viz.,
p 1/p
Distance = { Sum |X - X'| } for p>=1.
The default value is 2, that is, the "usual" Euclidean distance measure. An
alternative that is particularly reasonable for categorical data is p=1, the
city-block or absolute value metric.
Display option
--------------
If you specify ^display^, a two-dimensional plot of the first two variables
in ^varlist^ with the cluster number is performed at each iteration. This
allows you to watch the convergence process over iterations.
Medoid option
-------------
The default for ^cluster^ is multidimensional centroids, i.e. means. However,
a more robust method is to use multidimensional medoids, i.e. medians. If you
choose the medoid option, distances will be calculated from medians rather
than means. All other aspects of operation are unchanged.
Standardize option
------------------
If you specify the ^std^ (standardize) option, all variables in ^varlist^ are
standardized as z-scores, i.e., the standardized value equals the initial
value minus its mean, this quantity divided by its standard deviation. The
default is no standardization, but beware of this when the variables in
^varlist^ are measured on different scales. See Kaufman and Rousseeuw
(1990:9-11), ^Finding groups in data: An introduction to cluster analysis^
for some discussion of standardization pro's and con's.
First and equal options
-----------------------
The ^first^ and ^equal^ options control how ^cluster^ assigns cases to
clusters before its first reassignment pass. The default method is
to randomly assign cases to clusters with probability 1/k, where k is
the number of clusters. From these randomly constructed clusters, the
first centroids or medoids are calculated, and reassignment passes begin.
If you specify the ^first^ option, the first k cases with no missing values
will be taken as starting centroids or medoids. If you specify the ^equal^
option, cases will be assigned systematically to clusters, the first case to
the first cluster, second case to the second cluster, and so on until all
cases are assigned to the k clusters in approximately equal numbers. Because
the ^first^ option solutions are dependent upon the sort order of the data,
the ^first^ option is not especially recommended.
Keep option
-----------
If you specify the ^keep^ option, the variables generated during the algorithm
will be kept with your data set. These variables include CENTij (or MEDij)
and DISTi, where "i" is an integer indicating centroid number "i" and "j" is
an integer indicating variable "j" from ^varlist^.
Output
------
There is one output produced by the ^cluster^ program. An iteration log is
printed as reassignment passes occur, and, upon convergence, summary
statistics of each cluster are presented.
Alternative output could include:
1) Graphing cluster membership against variables. A simple method would be
to use the command "graph var1 var2, s([cluster name])" where cluster name is
the name of the variable generated by cluster. If you have installed the ^gr3^
command (STB-2, gr3), you could plot in 3 dimensions.
2) listing the case, cluster membership, distance and centroid, for all cases.
NOTE: A known bug in this program is, occasionally, total sum of distances
will increase after a reassignment pass. This should not happen.
This program is based on Anderberg, Michael (1973). Cluster analysis for
applications. NY, NY: Academic press. Chapter 7, Forgy's method, page 161.
Author contact information is in the ado file.