------------------------------------------------------------------------------- help for mahascore2 -------------------------------------------------------------------------------
Generate a Mahalanobis distance measure between two "points", - either explicitly specified or as the means of specified populations
mahascore2 varlist [weight] , [ point1(point1mat) point2(point2mat) pop1(pop1var) pop2(pop2var) covarpop(covarpopvar) invcovarmat(invcovarmat) compute_invcovarmat euclidean union display(display_options) ]
Description
mahascore2 generates a squared Mahalanobis distance measure between two "points" in "data space" - with data space defined by varlist.
In contrast to other programs of the mahapick suite (mahascore, mahascores, mahapick), which generate multitudes of values, this generates a single value: the squared distance between the two points. The value is returned in r(mahascore_sq).
varlist (the "covariates") is a list of numeric variables on which to build the distance measure. These variables should be of numeric significance, not categorical; any categorical variables should be replaced by a set of indicator variables.
Weights are allowed; they affect only the means computed under the pop1 and pop2 options, and the computation of the inverse covariance matrix, under the compute_invcovarmat option.
Each point can be specified as either... - a tuple of values stored in a matrix, using the point1 and/or point2 options; - the means of the variables of varlist, restricted to a specified subset of the observations, using the pop1 and/or pop2 options. Note that either point may be specied by either method.
The result is actually the square of the Mahalanobis distance measure; both the squared and unsquared values are reported, but only the squared value is returned. Note that in most usages, the resulting values are used in comparisons or sortings; the proportional magnitude is not significant, so the squared value is just as good.
Options
In what follows, let p denote the number of variables in varlist.
point1(point1mat) specifies the first point in explicit terms. point1mat is the name of a matrix bearing a tuple of values; it must be a column vector (a p-by-1 matrix) whose entries correspond to the variables in varlist, and whose rownames equal the names in varlist in the same order. An example of how to do this is given below.
pop1(pop1var) specifies the first point implicitly as the tuple of means of varlist, limited to the set of observations for which pop1var is nonzero. This is also known to as the centroid of varlist, limited to the population indicated by pop1var. Note that the means are computed subject to weighting.
It is required to specify point1(point1mat) or pop1(pop1var). If both are present, then pop1(pop1var) takes precedence.
point2(point2mat) specifies the second point in explicit terms, in the same manner as point1.
pop2(pop2var) specifies the second point implicitly as the tuple of means of varlist, limited to the set of observations for which pop2var is nonzero. Note that the means are computed subject to weighting.
It is required to specify point2(point2mat) or pop2(pop2var). If both are present, then pop2(pop2var) takes precedence.
invcovarmat(invcovarmat) specifies the name of a matrix to be used in the computation described under Remarks. It is presumably the inverse covariance matrix of varlist (possibly for some subset of the observations), but the only requirement is that it be a square p-by-p matrix, and both the row and column names must equal the names in varlist in the same order as in varlist.
You can use covariancemat to help construct the inverse covariance matrix; it should be followed by a mat ... = inv() operation. An example is given below, in the Examples section. See further discussion of the purpose of this option, under Remarks.
compute_invcovarmat specifies that you want the inverse covariance matrix to be computed, rather than passed in (via invcovarmat()). This computation is subject to weighting. Note that this will call covariancemat, which computes covariances limited to observations with all variables of varlist nonmissing. (I.e., it is potentially different from the pairwise computation of covariances.)
If compute_invcovarmat is specified, then the set of observations that are used for this computation will be determined by pop1var, pop2var, or covarpopvar, as will be explained below.
invcovarmat() and compute_invcovarmat are alternatives; one of them must be specified. If both are specified, then compute_invcovarmat takes precedence.
covarpop(covarpopvar) takes effect only if compute_invcovarmat is specified. This specifies that the inverse covariance matrix is to be computed on the set of observations for which covarpopvar is nonzero. If compute_invcovarmat is specified and covarpop(covarpopvar) is absent, then the computation of the inverse covariance matrix is based on the sets specified by pop1var and pop2var, as will be explained below. If you specified compute_invcovarmat, and neither pop1(pop1var) nor pop2(pop2var) are specified, then covarpop(covarpopvar) is required.
---------------------------------------------------------------------- Computation of the Inverse Covariance Matrix
Reiterating and expanding on the foregoing, when compute_invcovarmat is specified, the set of observation used in that computation is determined by... - covarpopvar, if specified; otherwise... - pop1var, if specified and pop2var is absent - pop2var, if specified and pop1var is absent - the combination of pop1var and pop1var, if both are specified.
In the latter scenario (compute_invcovarmat is specified, along with both pop1(pop1var) and pop2(pop2var), and in the absence of covarpop(covarpopvar)), there is a choice of how to make use of the "combination of pop1var and pop1var". The default is a "split population" method, which takes the covariance matrices of each population separately, then forms the weighted average of these two matrices, weighting them by the number of observations in pop1var and pop2var, along with an optional weight. That result is then inverted.
By contrast, you can use the union option, which simply uses the union of the two sets specified by pop1var and pop2var. See more on this under the union option. ----------------------------------------------------------------------
euclidean takes effect only if compute_invcovarmat is specified. It specifies that the off-diagonal elements of the covariance matrix are to be replaced with zeroes, which yields the normalized Euclidean distance measure. (This option applies only with compute_invcovarmat because the zeroing of off-diagonal elements is done to the covariance matrix - i.e., prior to inversion. If you prefer this measure and are providing the matrix via the invcovarmat() option, you should zero-out the off-diagonal elements prior to inverting - or directly construct a matrix of reciprocal variances. Note that if the diagonal elements of a matrix are c1, c2, ..., cp, and all other elements are zero, then its inverse consists of 1/c1, 1/c2, ..., 1/cp on the diagonal and zero elsewhere.) See more about this under Remarks.
union specifies that, when compute_invcovarmat is specified, along with both pop1(pop1var) and pop2(pop2var), and in the absence of covarpop(covarpopvar), that the inverse covariance matrix is to be computed on the union of the two sets specified by pop1var and pop2var. By contrast, the default action uses a "split population" method, as described under Computation of the Inverse Covariance Matrix.
It is probably desirable not to use the union option, as it may overestimate the covariances and thereby underestimate inverse covariance matrix and the consequential distance measure. This can be understood if you imagine two distinct populations where the values of the covariates are somewhat tightly clustered around two distinct centers that are significantly separated. Within each population, the covariances are small, but because of the separation of the centers, the covariances on the union are larger.
display(display_options) turns on the display of certain data structures used in the computation. If display_options contains covar and compute_invcovarmat was specified, then the covariance matrix (matrices) is (are) displayed; if it contains invcov, then the inverse covariance matrix is displayed; if it contains points then the point(s) (point1mat or point2mat) or the tuple(s) of means for pop1 or pop2 are displayed; if it contains diff, then the difference vector is displayed. Any other content is ignored.
If the inverse covariance matrix is displayed, it may be either invcovarmat or that which is computed as directed by the compute_invcovarmat option. This may be useful in debugging or just to assure you that the same set of (inverse) covariances are being used in repeated calls.
Remarks
The (squared) distance measure generated is the matrix product d'Xd, where d is a vector of differences in the set of variables, and X is either the inverse of the covariance matrix of varlist (computed on a limited set of observations, as described above), or is a specified matrix that is provided via the invcovarmat() option.
The difference vector d is taken between the two points. That is, d= (pt2_1 - pt1_1 \ pt2_2 - pt1_2 \ ... \ pt2_p - pt1_p) where pt1_j is the jth element of pt1, corresponding to the jth variable of varlist, and pt1 is the first point, that is, either point1mat or the vector of means of varlist computed on the set indicated by pop1var, depending on how the first point was specified. Similarly for pt2_j and the second point.
Thus, the generated value is the sum of all the possible products of pairs of elements of d, weighted by corresponding elements of X. This includes components that are the squares of elements of d, weighted by the elements on the diagonal of X, plus other products (of differing elements of d), weighted by the off-diagonal elements of X.
Note that the generated value is a single number, though formally it is a 1-by-1 matrix. It is expected to be >=0 if X is truly an inverse covariance matrix, as such matrices are known to be positive semi-definite. However, if X is an arbitrary matrix, then there is no guarantee that the result will be nonnegative.
There are two purposes for the invcovarmat() option. First, it can save unnecessary repeated calculations whenever mahascore2 is repeatedly called on the same dataset with the same intended covariance population. Secondly, you may want to compute the inverse covariance matrix in some way not provided for. If these conditions do not apply, then the compute_invcovarmat option is appropriate.
The euclidean option, combined with compute_invcovarmat, yields the normalized Euclidean distance. It can be considered as a simplified version of the true Mahalanobis measure, and is less thorough in that it ignores correlations between different variables of varlist. It suffers from the flaw that highly correlated variables can act together as one variable but with disproportional weight. Another way to characterize it is that it presumes that the data are configured in ellipsoids that are oriented parallel to the axes. (In other contexts, it may fail to detect multivariate outliers. See mahascores for more on this, as well as other comments about the Euclidean measure - normalized or not.)
The normalized Euclidean measure is probably less desirable than the true Mahalanobis measure; it is provided as a comparison measure, and it replicates the behavior of earlier versions of mahascore and mahapick programs.
If any of these conditions occur, then the resulting measure will be missing.
Any element of one of the points is missing (if an elemnt of point1mat or point2mat is missing, or if a covariate is all-missing for one of the sets indicated by pop1var or pop2var).
Any of the inverse covariance elements are missing.
If the inverse covariance matrix is computed on a very small set of observations, it may not be valid and may yield strange results. It might fail to be positive semi-definite, and can yield negative measures.
Examples
. gen byte not_treated = ~treated . mahascore2 income age numkids, pop1(treated) pop2(not_treated) compute
. mahascore2 income age numkids, pop1(treated) pop2(not_treated) covarpop(treated) compute
. sysuse auto . gen byte dom = ~foreign . mahascore2 price mpg rep78 headroom trunk weight length turn displac, pop1(foreign) pop2(dom) compute
Note that the above use of treated and not_treated (or foerign and dom) partitions the observations into two complementary sets. This may be a commonly-desired setup, but is not required.
To create your own inverse covariance matrix:
. local vars "income age numkids" . covariancemat `vars', covarmat(M) - to use all observations, or... . covariancemat `vars' in 1/60, covarmat(M) - to use the first 60 observations.
. mat MINV = inv(M) // or possibly invsym(M) . mahascore2 `vars', pop1(treated) pop2(not_treated) invcovarmat(MINV)
To create your own reference values:
. local vars "income age numkids" . matrix V1 = (20000 \ 25 \ 2) . matrix V2 = (26000 \ 29 \ 1) . matrix rownames V1 = `vars' . matrix rownames V2 = `vars' . gen byte one = 1 . mahascore2 `vars', point1(V1) point2(V2) covarpop(all) compute . mahascore2 `vars', point1(V1) point2(V2) covarpop(treated) compute . mahascore2 `vars', point1(V1) pop(treated) compute
Acknowledgement The author wishes to thank Evan Kontopantelis of the University of Manchester for suggesting this program.
Additional thanks goes to Joseph Harkness, formerly of The Institute for Policy Studies at Johns Hopkins University for guidance in developing the suite of Mahalanobis distance programs, as well as Heiko Giebler of Wissenschaftszentrum Berlin fur Sozialforschung GmbH, for suggesting further improvements.
Author David Kantor; Email kantor.d@att.net if you observe any problems.
Also See mahapick, mahascore, mahascores, covariancemat, variancemat, screenmatches, stackids, hotelling.
-------------------------------------------------------------------- Note: The hotelling program is similar in that it generates a Mahalanobis distance measure; it then uses that result to perform a significance test. The author (of mahascore2) believes - though is not certain - that hotelling does the equivalent of the union option for computing the covariance matrix. --------------------------------------------------------------------