Title
clstop_lbt -- Steinley & Brusco's lower bound technique (LBT) to determine the number of kmeans clusters
Syntax
cluster stop [clname], rule(lbt)
Description
clstop_lbt adds the rule lbt to the post-estimation command cluster stop to determine the number of k-means clusters using Steinley & Brusco's (2011) lower bound technique (LBT).
clstop_lbt creates the normalized index LBT that measures the closeness of the observed value of the within-cluster sums of squares (SSE) to the minimum value of SSE in terms of total sums of squares (SST) according to LBT = (SSE - SSE(min))/SST. The method to determine the lower bound of the SSE is given in Steinley & Brusco (2011, p. 289). If the number of variables is equal or less than the number of clusters k, LBT is equal to the ratio SSE/SST (in this case, the LBT cannot be used). Using the LBT, a partition into k clusters is chosen such that LBT(k) is minimum.
clstop_lbt can also be used to determine whether there is more than one cluster. In this case the ratio SSE(2)/SST of a two cluster solution should be less than the lower bound ratio (LBR) obtainable when there is only one cluster - assuming a (multivariate) normal distribution, the LBR(normal) is 1-2/pi = .3634, assuming a univariate distribution the LBR(univariate) is .25.
A simulation study by Steinley & Brusco (2011) shows that the LBT index outperforms the accuracy and precision of the CH (Calinski-Harabasz) index. However, the LBT requires that the number of variables exceed the number of clusters. In cases of equal or more clusters than the number of variables Steinley & Brusco recommend to use the CH index which is also calculated by clstop_lbt (see Saved Results) and which is the default when using -cluster stop-.
Example
. webuse iris . cluster kmeans seplen-petwid, k(2) s(pr(1)) . cluster stop, rule(lbt) . cluster kmeans seplen-petwid, k(3) s(pr(1)) . cluster stop, rule(lbt) . cluster kmeans seplen-petwid, k(4) s(pr(1)) . cluster stop, rule(lbt)
cluster stop with rule(lbt) saves the following in r():
Scalars r(N) number of valid cases (listwise) r(k) number of partitions (clusters) r(SSE_#) Within clusters (error) sum of squares for # partitions r(SSB_#) Between clusters sum of squares for # partitions r(SSE_SST_#) Ratio SSE/SST for # partitions r(calinski_#) Calinski & Harabasz pseudo F for # partitions r(LBT_#) Index LBT for # partitions
Macros r(clname) name of the cluster analysis r(vars) list of variables used r(rule) lbt
References
Steinley, D. & Brusco, M. J. (2011). Choosing the number of clusters in K-means clustering. Psychological Methods, 16, 285-297.
Also see
Manual: [MV] cluster programming subroutines
Author
Dirk Enzmann Institute of Criminal Sciences, Hamburg email: mailto:dirk.enzmann@uni-hamburg.de