{smcl}
{* Copyright 2016 Brendan Halpin brendan.halpin@ul.ie }
{* Distribution is permitted under the terms of the GNU General Public Licence }
{* 16Jun2016}{...}
{cmd:help calinski}
{hline}

{title:Title}

{p2colset 5 20 22 2}{...}
{p2col:{hi:calinski} {hline 2}}Calinski-Harabasz cluster stopping index from distance matrix{p_end}
{p2colreset}{...}

{title:Syntax}

{p 8 17 2}
{cmd:calinski} , DISTmat(string) IDvar(varname) [NGroups(integer 15) NAME(clname) GRaph *]

{synoptset 22 tabbed}{...}
{synopthdr:options}
{synoptline}
{syntab:Required}
{synopt:{opt dist:mat(matname)}} names the distance matrix{p_end}
{synopt:{opt id:var(varname)}} identifies the variable that links the sort-order of the distance matrix to the sort-order of the data{p_end}
{syntab:Optional}
{synopt:{opt ng:roups}} The number of cluster solutions to test (default 15){p_end}
{synopt:{opt name}} Name of cluster analysis to use{p_end}
{synopt:{opt gr:aph}} plot the index against cluster size{p_end}
{synopt:{it:twoway_options}}options allowed with {helpb graph twoway}{p_end}
{synoptline}
{p2colreset}{...}

{title:Description}

{pstd}{cmd:calinski} calculates the Calinski-Harabasz pseudo-F for
stopping rules in cluster analysis, from the pairwise distance matrix.
This is widely used to determine the optimum number of clusters. Stata's
default {helpb cluster stop} does the same calculation on the basis of the
original variables, but cannot operate on the distance matrix.
{cmd:calinski} is thus useful when the original variables are not
available, or when the distances are created other than as squared
Euclidean distances between variables (as is the case for instance with
sequence analysis). {p_end}

{pstd} {bf:NB:} Stata's built-in {cmd:clustermat stop, variables(...)}
does {it:not} estimate the CH pseudo-F on the distance matrix used by
{cmd:clustermat}. Rather, it creates a new temporary distance matrix
based on the variables listed in the {cmd:variables()} option. {p_end}

{pstd}{cmd:calinski} depends on {help discrepancy} which
can be installed from SSC:{p_end}

{phang}{cmd:. ssc install discrepancy}

{pstd}Returns:{p_end}
      
{phang}r(calinski_#)       Calinski-Harabasz pseudo-F for # groups{p_end}


{title:Remarks}

{pstd}While {cmd:cluster stop} and {cmd:clustermat stop} estimate the
CH pseudo-F by cumulating the sum of squares from ANOVAs of the original
variables on the cluster solution, and are therefore explicitly rooted in
a squared-Euclidean distance point of view, {cmd:calisnki} takes the
distances as they are found. If they are squared distances based on the
original variables, the results will be identical to {cmd:cluster stop}.
If they are squared Euclidean distances from another source, the
interpretation will be the same. If they are other sorts of differences
(e.g., non-Euclidean) the interpretation is not necessarily the same,
but can be understood to be analogous, in the same way as the
{cmd:discrepancy} partitioning of the distance matrix (described by
Studer et al 2011) is analogous to ANOVA.{p_end}

{pstd}Because the order of the data and the order of the distance matrix
must coincide, the dataset must be sorted by {opt id:var}. It is the
user's responsibility that this variable defines the correct order.{p_end}

{title:References}

{p 4 4 2} Milligan, G. W., and M. C. Cooper. 1985.  An examination of procedures for determining the number of clusters in a dataset. {it:Psychometrika} 50: 159-179. {p_end}
{p 4 4 2} M Studer, G Ritschard, A Gabadinho and NS Müller, Discrepancy analysis of state sequences, {it:Sociological Methods and Research}, 40(3):471-510 {p_end}


{title:Author}

{phang}Brendan Halpin, brendan.halpin@ul.ie{p_end}


{title:Examples}

{phang}{cmd:. calinski, dist(distances) id(id) graph}{p_end}

{title:See Also}

{phang}{help cluster stop}{p_end}
{phang}{help dudahart}{p_end}
{phang}{help discrepancy}{p_end}
{phang}{help SADI}{p_end}