------------------------------------------------------------------------------- help for

mahascore2-------------------------------------------------------------------------------

Generate a Mahalanobis distance measure between two "points",-either explicitly specified or as the means of specified populations

mahascore2varlist[weight],[point1(point1mat)point2(point2mat)pop1(pop1var)pop2(pop2var)covarpop(covarpopvar)invcovarmat(invcovarmat)compute_invcovarmateuclideanuniondisplay(display_options)]

Description

mahascore2generates a squared Mahalanobis distance measure between two "points" in "data space" - with data space defined byvarlist.In contrast to other programs of the mahapick suite (

mahascore,mahascores,mahapick), which generate multitudes of values, this generates a single value: the squared distance between the two points. The value is returned inr(mahascore_sq).

varlist(the "covariates") is a list of numeric variables on which to build the distance measure. These variables should be of numeric significance, not categorical; any categorical variables should be replaced by a set of indicator variables.Weights are allowed; they affect only the means computed under the

pop1andpop2options, and the computation of the inverse covariance matrix, under thecompute_invcovarmatoption.Each point can be specified as either... - a tuple of values stored in a matrix, using the

point1and/orpoint2options; - the means of the variables ofvarlist, restricted to a specified subset of the observations, using thepop1and/orpop2options. Note that either point may be specied by either method.The result is actually the square of the Mahalanobis distance measure; both the squared and unsquared values are reported, but only the squared value is returned. Note that in most usages, the resulting values are used in comparisons or sortings; the proportional magnitude is not significant, so the squared value is just as good.

OptionsIn what follows, let

pdenote the number of variables invarlist.

point1(point1mat)specifies the first point in explicit terms.point1matis the name of a matrix bearing a tuple of values; it must be a column vector (ap-by-1 matrix) whose entries correspond to the variables invarlist, and whose rownames equal the names invarlistin the same order. An example of how to do this is given below.

pop1(pop1var)specifies the first point implicitly as the tuple of means ofvarlist, limited to the set of observations for whichpop1varis nonzero. This is also known to as the centroid ofvarlist, limited to the population indicated bypop1var. Note that the means are computed subject to weighting.It is required to specify

point1(point1mat)orpop1(pop1var). If both are present, thenpop1(pop1var)takes precedence.

point2(point2mat)specifies the second point in explicit terms, in the same manner aspoint1.

pop2(pop2var)specifies the second point implicitly as the tuple of means ofvarlist, limited to the set of observations for whichpop2varis nonzero. Note that the means are computed subject to weighting.It is required to specify

point2(point2mat)orpop2(pop2var). If both are present, thenpop2(pop2var)takes precedence.

invcovarmat(invcovarmat)specifies the name of a matrix to be used in the computation described underRemarks. It is presumably the inverse covariance matrix ofvarlist(possibly for some subset of the observations), but the only requirement is that it be a squarep-by-pmatrix, and both the row and column names must equal the names invarlistin the same order as invarlist.You can use covariancemat to help construct the inverse covariance matrix; it should be followed by a

mat...= inv()operation. An example is given below, in theExamplessection. See further discussion of the purpose of this option, underRemarks.

compute_invcovarmatspecifies that you want the inverse covariance matrix to be computed, rather than passed in (viainvcovarmat()). This computation is subject to weighting. Note that this will call covariancemat, which computes covariances limited to observations with all variables ofvarlistnonmissing. (I.e., it is potentially different from the pairwise computation of covariances.)If

compute_invcovarmatis specified, then the set of observations that are used for this computation will be determined bypop1var,pop2var, orcovarpopvar, as will be explained below.

invcovarmat()andcompute_invcovarmatare alternatives; one of them must be specified. If both are specified, thencompute_invcovarmattakes precedence.

covarpop(covarpopvar)takes effect only ifcompute_invcovarmatis specified. This specifies that the inverse covariance matrix is to be computed on the set of observations for whichcovarpopvaris nonzero. Ifcompute_invcovarmatis specified andcovarpop(covarpopvar)is absent, then the computation of the inverse covariance matrix is based on the sets specified bypop1varandpop2var, as will be explained below. If you specifiedcompute_invcovarmat, and neitherpop1(pop1var)norpop2(pop2var)are specified, thencovarpop(covarpopvar)is required.----------------------------------------------------------------------

Computation of the Inverse Covariance MatrixReiterating and expanding on the foregoing, when

compute_invcovarmatis specified, the set of observation used in that computation is determined by... -covarpopvar, if specified; otherwise... -pop1var, if specified andpop2varis absent -pop2var, if specified andpop1varis absent - the combination ofpop1varandpop1var, if both are specified.In the latter scenario (

compute_invcovarmatis specified, along with bothpop1(pop1var)andpop2(pop2var), and in the absence ofcovarpop(covarpopvar)), there is a choice of how to make use of the "combination ofpop1varandpop1var". The default is a "split population" method, which takes the covariance matrices of each population separately, then forms the weighted average of these two matrices, weighting them by the number of observations inpop1varandpop2var, along with an optionalweight. That result is then inverted.By contrast, you can use the

unionoption, which simply uses the union of the two sets specified bypop1varandpop2var. See more on this under theunionoption. ----------------------------------------------------------------------

euclideantakes effect only ifcompute_invcovarmatis specified. It specifies that the off-diagonal elements of the covariance matrix are to be replaced with zeroes, which yields the normalized Euclidean distance measure. (This option applies only withcompute_invcovarmatbecause the zeroing of off-diagonal elements is done to the covariance matrix - i.e., prior to inversion. If you prefer this measure and are providing the matrix via theinvcovarmat()option, you should zero-out the off-diagonal elements prior to inverting - or directly construct a matrix of reciprocal variances. Note that if the diagonal elements of a matrix are c1, c2, ..., cp, and all other elements are zero, then its inverse consists of 1/c1, 1/c2, ..., 1/cpon the diagonal and zero elsewhere.) See more about this underRemarks.

unionspecifies that, whencompute_invcovarmatis specified, along with bothpop1(pop1var)andpop2(pop2var), and in the absence ofcovarpop(covarpopvar), that the inverse covariance matrix is to be computed on the union of the two sets specified bypop1varandpop2var. By contrast, the default action uses a "split population" method, as described underComputation of the Inverse Covariance Matrix.It is probably desirable

notto use theunionoption, as it may overestimate the covariances and thereby underestimate inverse covariance matrix and the consequential distance measure. This can be understood if you imagine two distinct populations where the values of the covariates are somewhat tightly clustered around two distinct centers that are significantly separated. Within each population, the covariances are small, but because of the separation of the centers, the covariances on the union are larger.

display(display_options)turns on the display of certain data structures used in the computation. Ifdisplay_optionscontainscovarandcompute_invcovarmatwas specified, then the covariance matrix (matrices) is (are) displayed; if it containsinvcov, then the inverse covariance matrix is displayed; if it containspointsthen the point(s) (point1matorpoint2mat) or the tuple(s) of means forpop1orpop2are displayed; if it containsdiff, then the difference vector is displayed. Any other content is ignored.If the inverse covariance matrix is displayed, it may be either

invcovarmator that which is computed as directed by thecompute_invcovarmatoption. This may be useful in debugging or just to assure you that the same set of (inverse) covariances are being used in repeated calls.

RemarksThe (squared) distance measure generated is the matrix product d'Xd, where d is a vector of differences in the set of variables, and X is either the inverse of the covariance matrix of

varlist(computed on a limited set of observations, as described above), or is a specified matrix that is provided via theinvcovarmat()option.The difference vector d is taken between the two points. That is, d= (pt2_1 - pt1_1 \ pt2_2 - pt1_2 \ ... \ pt2_

p- pt1_p) where pt1_jis thejth element of pt1, corresponding to thejth variable ofvarlist, and pt1 is the first point, that is, eitherpoint1mator the vector of means ofvarlistcomputed on the set indicated bypop1var, depending on how the first point was specified. Similarly for pt2_jand the second point.Thus, the generated value is the sum of all the possible products of pairs of elements of d, weighted by corresponding elements of X. This includes components that are the squares of elements of d, weighted by the elements on the diagonal of X, plus other products (of differing elements of d), weighted by the off-diagonal elements of X.

Note that the generated value is a single number, though formally it is a 1-by-1 matrix. It is expected to be >=0 if X is truly an inverse covariance matrix, as such matrices are known to be positive semi-definite. However, if X is an arbitrary matrix, then there is no guarantee that the result will be nonnegative.

There are two purposes for the

invcovarmat()option. First, it can save unnecessary repeated calculations whenevermahascore2is repeatedly called on the same dataset with the same intended covariance population. Secondly, you may want to compute the inverse covariance matrix in some way not provided for. If these conditions do not apply, then thecompute_invcovarmatoption is appropriate.The

euclideanoption, combined withcompute_invcovarmat, yields the normalized Euclidean distance. It can be considered as a simplified version of the true Mahalanobis measure, and is less thorough in that it ignores correlations between different variables ofvarlist. It suffers from the flaw that highly correlated variables can act together as one variable but with disproportional weight. Another way to characterize it is that it presumes that the data are configured in ellipsoids that are oriented parallel to the axes. (In other contexts, it may fail to detect multivariate outliers. See mahascores for more on this, as well as other comments about the Euclidean measure - normalized or not.)The normalized Euclidean measure is probably less desirable than the true Mahalanobis measure; it is provided as a comparison measure, and it replicates the behavior of earlier versions of

mahascoreandmahapickprograms.If any of these conditions occur, then the resulting measure will be missing.

Any element of one of the points is missing (if an elemnt of

point1matorpoint2matis missing, or if a covariate is all-missing for one of the sets indicated bypop1varorpop2var).Any of the inverse covariance elements are missing.

If the inverse covariance matrix is computed on a very small set of observations, it may not be valid and may yield strange results. It might fail to be positive semi-definite, and can yield negative measures.

Examples

. gen byte not_treated = ~treated. mahascore2 income age numkids, pop1(treated) pop2(not_treated) compute

. mahascore2 income age numkids, pop1(treated) pop2(not_treated)covarpop(treated) compute

. sysuse auto. gen byte dom = ~foreign. mahascore2 price mpg rep78 headroom trunk weight length turn displac,pop1(foreign) pop2(dom) computeNote that the above use of

treatedandnot_treated(orfoerignanddom) partitions the observations into two complementary sets. This may be a commonly-desired setup, but is not required.To create your own inverse covariance matrix:

. local vars "income age numkids". covariancemat `vars', covarmat(M)- to use all observations, or.... covariancemat `vars' in 1/60, covarmat(M)- to use the first 60 observations.

. mat MINV = inv(M) // or possibly invsym(M). mahascore2 `vars', pop1(treated) pop2(not_treated) invcovarmat(MINV)To create your own reference values:

. local vars "income age numkids". matrix V1 = (20000 \ 25 \ 2). matrix V2 = (26000 \ 29 \ 1). matrix rownames V1 = `vars'. matrix rownames V2 = `vars'. gen byte one = 1. mahascore2 `vars', point1(V1) point2(V2) covarpop(all) compute. mahascore2 `vars', point1(V1) point2(V2) covarpop(treated) compute. mahascore2 `vars', point1(V1) pop(treated) compute

The author wishes to thank Evan Kontopantelis of the University of Manchester for suggesting this program.AcknowledgementAdditional thanks goes to Joseph Harkness, formerly of The Institute for Policy Studies at Johns Hopkins University for guidance in developing the suite of Mahalanobis distance programs, as well as Heiko Giebler of Wissenschaftszentrum Berlin fur Sozialforschung GmbH, for suggesting further improvements.

David Kantor; Email kantor.d@att.net if you observe any problems.Author

mahapick, mahascore, mahascores, covariancemat, variancemat, screenmatches, stackids, hotelling.Also See--------------------------------------------------------------------

Note:The hotelling program is similar in that it generates a Mahalanobis distance measure; it then uses that result to perform a significance test. The author (of mahascore2) believes - though is not certain - that hotelling does the equivalent of theunionoption for computing the covariance matrix. --------------------------------------------------------------------