{smcl}
{* 2012feb17; 2012nov12 deleted one char}
{hline}
help for {hi:mahascore2}
{hline}
{title:Generate a Mahalanobis distance measure between two "points",}
{title: {c -} either explicitly specified or as the means of specified populations}
{p 8 17 2}
{cmd:mahascore2}
{it:varlist} [{it:weight}] {cmd:,}
[
{cmd:point1(}{it:point1mat}{cmd:)}
{cmd:point2(}{it:point2mat}{cmd:)}
{cmd:pop1(}{it:pop1var}{cmd:)}
{cmd:pop2(}{it:pop2var}{cmd:)}
{cmd:covarpop(}{it:covarpopvar}{cmd:)}
{cmdab:invcov:armat(}{it:invcovarmat}{cmd:)}
{cmdab:compute:_invcovarmat}
{cmdab:eucl:idean}
{cmd:union}
{cmdab:disp:lay(}{it:display_options}{cmd:)}
]
{title:Description}
{p 4 4 2}
{cmd:mahascore2} generates a squared Mahalanobis distance measure between two "points"
in "data space" {c -} with data space defined by {it:varlist}.
{p 4 4 2}
In contrast to other programs of the mahapick suite ({cmd:mahascore}, {cmd:mahascores},
{cmd:mahapick}), which generate multitudes of values, this generates a single value:
the squared distance between the two points. The value is returned in {cmd:r(mahascore_sq)}.
{p 4 4 2}
{it:varlist} (the "covariates") is a list of numeric variables on which to build
the distance measure.
These variables should be of numeric significance, not categorical; any
categorical variables should be replaced by a set of indicator variables.
{p 4 4 2}
Weights are allowed; they affect only the means computed under the {cmd:pop1}
and {cmd:pop2} options, and the computation of the inverse covariance matrix, under
the {cmd:compute_invcovarmat} option.
{p 4 4 2}
Each point can be specified as either...{p_end}
{p 8 8 2}{c -} a tuple of values stored in a matrix, using the {cmd:point1} and/or {cmd:point2} options;{p_end}
{p 8 8 2}{c -} the means of the variables of {it:varlist}, restricted to a specified subset of the
observations, using the {cmd:pop1} and/or {cmd:pop2} options.{p_end}
{p 4 4 2}
Note that either point may be specied by either method.
{p 4 4 2}
The result is actually the square of the Mahalanobis
distance measure; both the squared and unsquared values are reported, but only the squared
value is returned.
Note that in most usages, the resulting values are used in comparisons
or sortings; the proportional magnitude is not significant, so the squared
value is just as good.
{title:Options}
{p 4 4 2}
In what follows, let {it:p} denote the number of variables in {it:varlist}.
{p 4 4 2}
{cmd:point1(}{it:point1mat}{cmd:)} specifies the first point in explicit terms.
{it:point1mat} is the name of a matrix bearing a tuple of values; it must be a
column vector (a {it:p}-by-1 matrix) whose entries correspond to
the variables in {it:varlist}, and whose rownames equal the names in
{it:varlist} in the same order.
An example of how to do this is given below.
{p 4 4 2}
{cmd:pop1(}{it:pop1var}{cmd:)} specifies the first point implicitly as the tuple of
means of {it:varlist}, limited to the set of observations for which {it:pop1var} is
nonzero. This is also known to as the centroid of {it:varlist}, limited to the
population indicated by {it:pop1var}.
Note that the means are computed subject to weighting.
{p 4 4 2}
It is required to specify {cmd:point1(}{it:point1mat}{cmd:)} or {cmd:pop1(}{it:pop1var}{cmd:)}.
If both are present, then {cmd:pop1(}{it:pop1var}{cmd:)} takes precedence.
{p 4 4 2}
{cmd:point2(}{it:point2mat}{cmd:)} specifies the second point in explicit terms,
in the same manner as {cmd:point1}.
{p 4 4 2}
{cmd:pop2(}{it:pop2var}{cmd:)} specifies the second point implicitly as the tuple of
means of {it:varlist}, limited to the set of observations for which {it:pop2var} is
nonzero. Note that the means are computed subject to weighting.
{p 4 4 2}
It is required to specify {cmd:point2(}{it:point2mat}{cmd:)} or {cmd:pop2(}{it:pop2var}{cmd:)}.
If both are present, then {cmd:pop2(}{it:pop2var}{cmd:)} takes precedence.
{p 4 4 2}
{cmd:invcovarmat(}{it:invcovarmat}{cmd:)} specifies the name of a matrix
to be used in the computation described under {ul:Remarks}. It is presumably the
inverse covariance matrix of {it:varlist} (possibly for some subset of the observations),
but the only requirement is that
it be a square {it:p}-by-{it:p} matrix, and both the row and column names
must equal the names in {it:varlist} in the same order as in {it:varlist}.
{p 4 4 2}
You can use {help covariancemat} to help construct the inverse covariance matrix;
it should be followed by a {cmd: mat} ... {cmd: = inv()} operation.
An example is given below, in the {ul:Examples} section.
See further discussion of the purpose of this option, under {ul:Remarks}.
{p 4 4 2}
{cmd:compute_invcovarmat} specifies that you want the inverse covariance
matrix to be computed, rather than passed in (via {cmd:invcovarmat()}).
This computation is subject to weighting.
Note that this will call {help covariancemat}, which computes covariances
limited to observations with all variables of {it:varlist} nonmissing.
(I.e., it is potentially different from the pairwise computation of covariances.)
{p 4 4 2}
If {cmd:compute_invcovarmat} is specified, then the set of observations that are used
for this computation will be determined by {it:pop1var}, {it:pop2var}, or {it:covarpopvar},
as will be explained below.
{p 4 4 2}
{cmd:invcovarmat()} and {cmd:compute_invcovarmat} are alternatives; one of
them must be specified. If both are specified, then {cmd:compute_invcovarmat}
takes precedence.
{p 4 4 2}
{cmd:covarpop(}{it:covarpopvar}{cmd:)} takes effect only if {cmd:compute_invcovarmat} is specified.
This specifies that the inverse covariance matrix is to be computed on the set of observations
for which {it:covarpopvar} is nonzero.
If {cmd:compute_invcovarmat} is specified and {cmd:covarpop(}{it:covarpopvar}{cmd:)} is absent,
then the computation of the inverse covariance matrix is based on the
sets specified by {it:pop1var} and {it:pop2var}, as will be explained below.
If you specified {cmd:compute_invcovarmat}, and neither {cmd:pop1(}{it:pop1var}{cmd:)} nor {cmd:pop2(}{it:pop2var}{cmd:)} are
specified, then {cmd:covarpop(}{it:covarpopvar}{cmd:)} is required.
{col 10}{hline}
{p 10 10 2}
{ul:Computation of the Inverse Covariance Matrix}
{p 10 10 2}
Reiterating and expanding on the foregoing, when {cmd:compute_invcovarmat} is specified, the set of observation used in that
computation is determined by...{p_end}
{p 10 10 2}{c -} {it:covarpopvar}, if specified; otherwise...{p_end}
{p 12 12 2}{c -} {it:pop1var}, if specified and {it:pop2var} is absent{p_end}
{p 12 12 2}{c -} {it:pop2var}, if specified and {it:pop1var} is absent{p_end}
{p 12 12 2}{c -} the combination of {it:pop1var} and {it:pop1var}, if both are specified.
{p 10 10 2}
In the latter scenario ({cmd:compute_invcovarmat} is specified, along with both
{cmd:pop1(}{it:pop1var}{cmd:)} and {cmd:pop2(}{it:pop2var}{cmd:)}, and in the absence of {cmd:covarpop(}{it:covarpopvar}{cmd:)}),
there is a choice of how to make use of the "combination of {it:pop1var} and {it:pop1var}".
The default is a "split population" method, which takes the covariance matrices of each population
separately, then forms the weighted
average of these two matrices, weighting them by the number of observations in {it:pop1var} and {it:pop2var},
along with an optional {it:weight}. That result is then inverted.
{p 10 10 2}
By contrast, you can use the {cmd:union} option, which simply uses the union of
the two sets specified by {it:pop1var} and {it:pop2var}. See more on this under the {cmd:union} option.{p_end}
{col 10}{hline}
{p 4 4 2}
{cmd:euclidean} takes effect only if {cmd:compute_invcovarmat} is specified.
It specifies that the off-diagonal elements of the covariance
matrix are to be replaced with zeroes, which yields the normalized Euclidean
distance measure. (This option applies only with {cmd:compute_invcovarmat}
because the zeroing of off-diagonal elements is done to the covariance
matrix {c -} i.e., prior to inversion.
If you prefer this measure and are providing the matrix via the {cmd:invcovarmat()}
option, you should zero-out the off-diagonal elements prior to inverting
{c -} or directly construct a matrix of reciprocal variances.
Note that if the diagonal elements of a matrix are c1, c2, ..., c{it:p}, and
all other elements are zero, then its inverse consists of 1/c1, 1/c2, ..., 1/c{it:p}
on the diagonal and zero elsewhere.)
See more about this under {ul:Remarks}.
{p 4 4 2}
{cmd:union} specifies that, when {cmd:compute_invcovarmat} is specified, along with both
{cmd:pop1(}{it:pop1var}{cmd:)} and {cmd:pop2(}{it:pop2var}{cmd:)}, and in the absence of {cmd:covarpop(}{it:covarpopvar}{cmd:)},
that the inverse covariance
matrix is to be computed on the union of the two sets
specified by {it:pop1var} and {it:pop2var}. By contrast, the default action uses a
"split population" method, as described under {ul:Computation of the Inverse Covariance Matrix}.
{p 4 4 2}
It is probably desirable {it:not} to use the {cmd:union} option, as it may overestimate the covariances
and thereby underestimate inverse covariance matrix and the consequential distance measure.
This can be understood if you imagine two distinct populations where
the values of the covariates are somewhat tightly clustered around two distinct centers that are significantly
separated. Within each population, the covariances are small, but because of the separation of the centers, the
covariances on the union are larger.
{p 4 4 2}
{cmd:display(}{it:display_options}{cmd:)} turns on the display of certain
data structures used in the computation. If {it:display_options} contains
{cmd:covar} and {cmd:compute_invcovarmat} was specified, then the covariance matrix (matrices) is (are) displayed;
if it contains {cmd:invcov}, then the inverse covariance matrix is displayed;
if it contains {cmd:points} then the point(s) ({it:point1mat} or {it:point2mat})
or the tuple(s) of means for {cmd:pop1} or {cmd:pop2} are displayed;
if it contains {cmd:diff}, then the difference vector is displayed.
Any other content is ignored.
{p 4 4 2}
If the inverse covariance matrix is displayed, it may be either
{it:invcovarmat} or that which is computed as directed by the
{cmd:compute_invcovarmat} option.
This may be useful in debugging or just to assure
you that the same set of (inverse) covariances are being used in repeated calls.
{title:Remarks}
{p 4 4 2}
The (squared) distance measure generated is the matrix product d'Xd, where d is a vector
of differences in the set of variables, and X is either the inverse of the
covariance matrix of {it:varlist} (computed on a limited set of observations,
as described above), or is a specified matrix that is provided via
the {cmd:invcovarmat()} option.
{p 4 4 2}
The difference vector d is taken between the two points. That is,{p_end}
{p 6 6 2}d= (pt2_1 - pt1_1 \ pt2_2 - pt1_2 \ ... \ pt2_{it:p} - pt1_{it:p}){p_end}
{p 4 4 2}where pt1_{it:j} is the {it:j}th element of pt1, corresponding to the
{it:j}th variable of {it:varlist}, and pt1 is the first point, that is, either
{it:point1mat} or the vector of means of {it:varlist} computed on the set indicated by
{it:pop1var}, depending on how the first point was specified. Similarly for
pt2_{it:j} and the second point.
{p 4 4 2}
Thus, the generated value is the sum of all the possible products of
pairs of elements of d, weighted by corresponding elements of X.
This includes components that are the
squares of elements of d, weighted by the elements on the diagonal of X, plus
other products (of differing elements of d), weighted by the off-diagonal
elements of X.
{p 4 4 2}
Note that the generated value is a single number, though
formally it is a 1-by-1 matrix. It is expected to be >=0 if X is truly an
inverse covariance matrix, as such matrices are known to be positive semi-definite.
However, if X is an arbitrary matrix, then there is no guarantee that the
result will be nonnegative.
{p 4 4 2}
There are two purposes for the {cmd:invcovarmat()} option.
First, it can save unnecessary repeated calculations whenever
{cmd:mahascore2} is repeatedly called on the same dataset with the same
intended covariance population. Secondly,
you may want to compute the inverse covariance matrix in some way
not provided for. If these conditions do not apply, then the
{cmd:compute_invcovarmat} option is appropriate.
{p 4 4 2}
The {cmd:euclidean} option, combined with {cmd:compute_invcovarmat}, yields
the normalized Euclidean distance. It can be considered as a simplified version
of the true Mahalanobis measure, and is less thorough in that it ignores
correlations between different variables of {it:varlist}.
It suffers from the flaw that highly correlated variables can act together
as one variable but with disproportional weight. Another way to characterize
it is that it presumes that the data are configured in ellipsoids that are
oriented parallel to the axes. (In other contexts, it may fail to detect multivariate
outliers. See {help mahascores} for more on this, as well as other comments about
the Euclidean measure {c -} normalized or not.)
{p 4 4 2}
The normalized Euclidean measure is probably less desirable than the true
Mahalanobis measure; it is provided as
a comparison measure, and it replicates the behavior of earlier versions of
{cmd:mahascore} and {cmd:mahapick} programs.
{p 4 4 2}
If any of these conditions occur, then the resulting measure will be missing.
{p 8 8 2}
Any element of one of the points is missing (if an elemnt of
{it:point1mat} or {it:point2mat} is missing, or if a
covariate is all-missing for one of the sets indicated by {it:pop1var}
or {it:pop2var}).
{p 8 8 2}
Any of the inverse covariance elements are missing.
{p 4 4 2}
If the inverse covariance matrix is computed on a very small set of
observations, it may not be valid and may yield strange results. It
might fail to be positive semi-definite, and can yield negative measures.
{title:Examples}
{p 4 8 2} {cmd:. gen byte not_treated = ~treated}{p_end}
{p 4 8 2} {cmd:. mahascore2 income age numkids, pop1(treated) pop2(not_treated) compute}
{p 4 8 2} {cmd:. mahascore2 income age numkids, pop1(treated) pop2(not_treated) covarpop(treated) compute}
{p 4 8 2} {cmd:. sysuse auto}{p_end}
{p 4 8 2} {cmd:. gen byte dom = ~foreign}{p_end}
{p 4 8 2} {cmd:. mahascore2 price mpg rep78 headroom trunk weight length turn displac, pop1(foreign) pop2(dom) compute}
{p 4 4 2}
Note that the above use of {cmd:treated} and {cmd:not_treated} (or {cmd:foerign} and {cmd:dom}) partitions the observations into two complementary sets.
This may be a commonly-desired setup, but is not required.
{p 4 4 2}
To create your own inverse covariance matrix:
{p 4 8 2} {cmd:. local vars "income age numkids"}{p_end}
{p 4 8 2} {cmd:. covariancemat `vars', covarmat(M)}{p_end}
{p 4 4 2} {c -} to use all observations, or...{p_end}
{p 4 8 2} {cmd:. covariancemat `vars' in 1/60, covarmat(M)}{p_end}
{p 4 4 2} {c -} to use the first 60 observations.{p_end}
{p 4 8 2} {cmd:. mat MINV = inv(M) // or possibly invsym(M)}{p_end}
{p 4 8 2} {cmd:. mahascore2 `vars', pop1(treated) pop2(not_treated) invcovarmat(MINV)}{p_end}
{p 4 4 2}
To create your own reference values:
{p 4 8 2} {cmd:. local vars "income age numkids"}{p_end}
{p 4 8 2} {cmd:. matrix V1 = (20000 \ 25 \ 2)}{p_end}
{p 4 8 2} {cmd:. matrix V2 = (26000 \ 29 \ 1)}{p_end}
{p 4 8 2} {cmd:. matrix rownames V1 = `vars'}{p_end}
{p 4 8 2} {cmd:. matrix rownames V2 = `vars'}{p_end}
{p 4 8 2} {cmd:. gen byte one = 1}{p_end}
{p 4 8 2} {cmd:. mahascore2 `vars', point1(V1) point2(V2) covarpop(all) compute}{p_end}
{p 4 8 2} {cmd:. mahascore2 `vars', point1(V1) point2(V2) covarpop(treated) compute}{p_end}
{p 4 8 2} {cmd:. mahascore2 `vars', point1(V1) pop(treated) compute}{p_end}
{title:Acknowledgement}
{p 4 4 2}
The author wishes to thank Evan Kontopantelis of the University of Manchester for suggesting
this program.
{p 4 4 2}
Additional thanks goes to
Joseph Harkness, formerly of The Institute for Policy Studies
at Johns Hopkins University for guidance in developing the suite of
Mahalanobis distance programs,
as well as Heiko Giebler of Wissenschaftszentrum Berlin fur Sozialforschung
GmbH, for suggesting further improvements.
{title:Author}
{p 4 4 2}
David Kantor;
Email {browse "mailto:kantor.d@att.net":kantor.d@att.net} if you observe any
problems.
{title:Also See}
{p 4 4 2}
{help mahapick}, {help mahascore}, {help mahascores}, {help covariancemat}, {help variancemat},
{help screenmatches}, {help stackids}, {help hotelling}.
{col 12}{hline}
{p 12 12 12}
{hi:Note:} The {help hotelling} program is similar in that it generates a Mahalanobis distance measure; it
then uses that result to perform a significance test. The author (of mahascore2) believes {c -} though
is not certain {c -} that hotelling
does the equivalent of the {cmd:union} option for computing the covariance matrix.
{p_end}
{col 12}{hline}