```
-------------------------------------------------------------------------------
help for mahascore2
-------------------------------------------------------------------------------

Generate a Mahalanobis distance measure between two "points",
- either explicitly specified or as the means of specified populations

mahascore2 varlist [weight] , [ point1(point1mat) point2(point2mat)
pop1(pop1var) pop2(pop2var) covarpop(covarpopvar)
invcovarmat(invcovarmat) compute_invcovarmat euclidean union
display(display_options) ]

Description

mahascore2 generates a squared Mahalanobis distance measure between two
"points" in "data space" - with data space defined by varlist.

In contrast to other programs of the mahapick suite (mahascore,
mahascores, mahapick), which generate multitudes of values, this
generates a single value:  the squared distance between the two points.
The value is returned in r(mahascore_sq).

varlist (the "covariates") is a list of numeric variables on which to
build the distance measure.  These variables should be of numeric
significance, not categorical; any categorical variables should be
replaced by a set of indicator variables.

Weights are allowed; they affect only the means computed under the pop1
and pop2 options, and the computation of the inverse covariance matrix,
under the compute_invcovarmat option.

Each point can be specified as either...
- a tuple of values stored in a matrix, using the point1 and/or
point2 options;
- the means of the variables of varlist, restricted to a specified
subset of the observations, using the pop1 and/or pop2 options.
Note that either point may be specied by either method.

The result is actually the square of the Mahalanobis distance measure;
both the squared and unsquared values are reported, but only the squared
value is returned.  Note that in most usages, the resulting values are
used in comparisons or sortings; the proportional magnitude is not
significant, so the squared value is just as good.

Options

In what follows, let p denote the number of variables in varlist.

point1(point1mat) specifies the first point in explicit terms.  point1mat
is the name of a matrix bearing a tuple of values; it must be a column
vector (a p-by-1 matrix) whose entries correspond to the variables in
varlist, and whose rownames equal the names in varlist in the same order.
An example of how to do this is given below.

pop1(pop1var) specifies the first point implicitly as the tuple of means
of varlist, limited to the set of observations for which pop1var is
nonzero. This is also known to as the centroid of varlist, limited to the
population indicated by pop1var.  Note that the means are computed
subject to weighting.

It is required to specify point1(point1mat) or pop1(pop1var).  If both
are present, then pop1(pop1var) takes precedence.

point2(point2mat) specifies the second point in explicit terms, in the
same manner as point1.

pop2(pop2var) specifies the second point implicitly as the tuple of means
of varlist, limited to the set of observations for which pop2var is
nonzero. Note that the means are computed subject to weighting.

It is required to specify point2(point2mat) or pop2(pop2var).  If both
are present, then pop2(pop2var) takes precedence.

invcovarmat(invcovarmat) specifies the name of a matrix to be used in the
computation described under Remarks. It is presumably the inverse
covariance matrix of varlist (possibly for some subset of the
observations), but the only requirement is that it be a square p-by-p
matrix, and both the row and column names must equal the names in varlist
in the same order as in varlist.

You can use covariancemat to help construct the inverse covariance
matrix; it should be followed by a mat ...  = inv() operation.  An
example is given below, in the Examples section.  See further discussion
of the purpose of this option, under Remarks.

compute_invcovarmat specifies that you want the inverse covariance matrix
to be computed, rather than passed in (via invcovarmat()).  This
computation is subject to weighting.  Note that this will call
covariancemat, which computes covariances limited to observations with
all variables of varlist nonmissing.  (I.e., it is potentially different
from the pairwise computation of covariances.)

If compute_invcovarmat is specified, then the set of observations that
are used for this computation will be determined by pop1var, pop2var, or
covarpopvar, as will be explained below.

invcovarmat() and compute_invcovarmat are alternatives; one of them must
be specified. If both are specified, then compute_invcovarmat takes
precedence.

covarpop(covarpopvar) takes effect only if compute_invcovarmat is
specified.  This specifies that the inverse covariance matrix is to be
computed on the set of observations for which covarpopvar is nonzero.  If
compute_invcovarmat is specified and covarpop(covarpopvar) is absent,
then the computation of the inverse covariance matrix is based on the
sets specified by pop1var and pop2var, as will be explained below.  If
you specified compute_invcovarmat, and neither pop1(pop1var) nor
pop2(pop2var) are specified, then covarpop(covarpopvar) is required.

----------------------------------------------------------------------
Computation of the Inverse Covariance Matrix

Reiterating and expanding on the foregoing, when
compute_invcovarmat is specified, the set of observation used in
that computation is determined by...
- covarpopvar, if specified; otherwise...
- pop1var, if specified and pop2var is absent
- pop2var, if specified and pop1var is absent
- the combination of pop1var and pop1var, if both are specified.

In the latter scenario (compute_invcovarmat is specified, along
with both pop1(pop1var) and pop2(pop2var), and in the absence of
covarpop(covarpopvar)), there is a choice of how to make use of the
"combination of pop1var and pop1var".  The default is a "split
population" method, which takes the covariance matrices of each
population separately, then forms the weighted average of these two
matrices, weighting them by the number of observations in pop1var
and pop2var, along with an optional weight. That result is then
inverted.

By contrast, you can use the union option, which simply uses the
union of the two sets specified by pop1var and pop2var. See more on
this under the union option.
----------------------------------------------------------------------

euclidean takes effect only if compute_invcovarmat is specified.  It
specifies that the off-diagonal elements of the covariance matrix are to
be replaced with zeroes, which yields the normalized Euclidean distance
measure. (This option applies only with compute_invcovarmat because the
zeroing of off-diagonal elements is done to the covariance matrix - i.e.,
prior to inversion.  If you prefer this measure and are providing the
matrix via the invcovarmat() option, you should zero-out the off-diagonal
elements prior to inverting - or directly construct a matrix of
reciprocal variances.  Note that if the diagonal elements of a matrix are
c1, c2, ..., cp, and all other elements are zero, then its inverse
consists of 1/c1, 1/c2, ..., 1/cp on the diagonal and zero elsewhere.)

union specifies that, when compute_invcovarmat is specified, along with
both pop1(pop1var) and pop2(pop2var), and in the absence of
covarpop(covarpopvar), that the inverse covariance matrix is to be
computed on the union of the two sets specified by pop1var and pop2var.
By contrast, the default action uses a "split population" method, as
described under Computation of the Inverse Covariance Matrix.

It is probably desirable not to use the union option, as it may
overestimate the covariances and thereby underestimate inverse covariance
matrix and the consequential distance measure.  This can be understood if
you imagine two distinct populations where the values of the covariates
are somewhat tightly clustered around two distinct centers that are
significantly separated. Within each population, the covariances are
small, but because of the separation of the centers, the covariances on
the union are larger.

display(display_options) turns on the display of certain data structures
used in the computation. If display_options contains covar and
compute_invcovarmat was specified, then the covariance matrix (matrices)
is (are) displayed; if it contains invcov, then the inverse covariance
matrix is displayed; if it contains points then the point(s) (point1mat
or point2mat) or the tuple(s) of means for pop1 or pop2 are displayed; if
it contains diff, then the difference vector is displayed.  Any other
content is ignored.

If the inverse covariance matrix is displayed, it may be either
invcovarmat or that which is computed as directed by the
compute_invcovarmat option.  This may be useful in debugging or just to
assure you that the same set of (inverse) covariances are being used in
repeated calls.

Remarks

The (squared) distance measure generated is the matrix product d'Xd,
where d is a vector of differences in the set of variables, and X is
either the inverse of the covariance matrix of varlist (computed on a
limited set of observations, as described above), or is a specified
matrix that is provided via the invcovarmat() option.

The difference vector d is taken between the two points. That is,
d= (pt2_1 - pt1_1 \ pt2_2 - pt1_2 \ ... \ pt2_p - pt1_p)
where pt1_j is the jth element of pt1, corresponding to the jth variable
of varlist, and pt1 is the first point, that is, either point1mat or the
vector of means of varlist computed on the set indicated by pop1var,
depending on how the first point was specified. Similarly for pt2_j and
the second point.

Thus, the generated value is the sum of all the possible products of
pairs of elements of d, weighted by corresponding elements of X.  This
includes components that are the squares of elements of d, weighted by
the elements on the diagonal of X, plus other products (of differing
elements of d), weighted by the off-diagonal elements of X.

Note that the generated value is a single number, though formally it is a
1-by-1 matrix. It is expected to be >=0 if X is truly an inverse
covariance matrix, as such matrices are known to be positive
semi-definite.  However, if X is an arbitrary matrix, then there is no
guarantee that the result will be nonnegative.

There are two purposes for the invcovarmat() option.  First, it can save
unnecessary repeated calculations whenever mahascore2 is repeatedly
called on the same dataset with the same intended covariance population.
Secondly, you may want to compute the inverse covariance matrix in some
way not provided for. If these conditions do not apply, then the
compute_invcovarmat option is appropriate.

The euclidean option, combined with compute_invcovarmat, yields the
normalized Euclidean distance. It can be considered as a simplified
version of the true Mahalanobis measure, and is less thorough in that it
ignores correlations between different variables of varlist.  It suffers
from the flaw that highly correlated variables can act together as one
variable but with disproportional weight. Another way to characterize it
is that it presumes that the data are configured in ellipsoids that are
oriented parallel to the axes. (In other contexts, it may fail to detect
multivariate outliers. See mahascores for more on this, as well as other
comments about the Euclidean measure - normalized or not.)

The normalized Euclidean measure is probably less desirable than the true
Mahalanobis measure; it is provided as a comparison measure, and it
replicates the behavior of earlier versions of mahascore and mahapick
programs.

If any of these conditions occur, then the resulting measure will be
missing.

Any element of one of the points is missing (if an elemnt of
point1mat or point2mat is missing, or if a covariate is all-missing
for one of the sets indicated by pop1var or pop2var).

Any of the inverse covariance elements are missing.

If the inverse covariance matrix is computed on a very small set of
observations, it may not be valid and may yield strange results. It might
fail to be positive semi-definite, and can yield negative measures.

Examples

. gen byte not_treated = ~treated
. mahascore2 income age numkids, pop1(treated) pop2(not_treated) compute

. mahascore2 income age numkids, pop1(treated) pop2(not_treated)
covarpop(treated) compute

. sysuse auto
. gen byte dom = ~foreign
. mahascore2 price mpg rep78 headroom trunk weight length turn displac,
pop1(foreign) pop2(dom) compute

Note that the above use of treated and not_treated (or foerign and dom)
partitions the observations into two complementary sets.  This may be a
commonly-desired setup, but is not required.

To create your own inverse covariance matrix:

. local vars "income age numkids"
. covariancemat `vars', covarmat(M)
- to use all observations, or...
. covariancemat `vars' in 1/60, covarmat(M)
- to use the first 60 observations.

. mat MINV = inv(M) // or possibly invsym(M)
. mahascore2 `vars', pop1(treated) pop2(not_treated) invcovarmat(MINV)

To create your own reference values:

. local vars "income age numkids"
. matrix V1 = (20000 \ 25 \ 2)
. matrix V2 = (26000 \ 29 \ 1)
. matrix rownames V1 = `vars'
. matrix rownames V2 = `vars'
. gen byte one = 1
. mahascore2 `vars', point1(V1) point2(V2) covarpop(all) compute
. mahascore2 `vars', point1(V1) point2(V2) covarpop(treated) compute
. mahascore2 `vars', point1(V1) pop(treated) compute

Acknowledgement
The author wishes to thank Evan Kontopantelis of the University of
Manchester for suggesting this program.

Additional thanks goes to Joseph Harkness, formerly of The Institute for
Policy Studies at Johns Hopkins University for guidance in developing the
suite of Mahalanobis distance programs, as well as Heiko Giebler of
Wissenschaftszentrum Berlin fur Sozialforschung GmbH, for suggesting
further improvements.

Author
David Kantor; Email kantor.d@att.net if you observe any problems.

Also See
mahapick, mahascore, mahascores, covariancemat, variancemat,
screenmatches, stackids, hotelling.

--------------------------------------------------------------------
Note: The hotelling program is similar in that it
generates a Mahalanobis distance measure; it then uses
that result to perform a significance test. The author
(of mahascore2) believes - though is not certain - that
hotelling does the equivalent of the union option for
computing the covariance matrix.
--------------------------------------------------------------------
```