{smcl}
{* 26feb2008; rev 1apr2008, 2011dec16, 2012feb9, 2012feb17; 2012nov12 deleted one char}
{hline}
help for {hi:mahascore}
{hline}
{title:Generate a Mahalanobis distance measure}
{p 8 17 2}
{cmd:mahascore}
{it:varlist} [{it:weight}] {cmd:,} {cmd:gen(}{it:newvar}{cmd:)}
[
{cmd:refobs(}{it:#}{cmd:)}
{cmd:refvals(}{it:refvalsmat}{cmd:)}
{cmd:refmeans}
{cmd: treated(}{it:treatedvar}{cmd:) }
{cmdab:invcov:armat(}{it:invcovarmat}{cmd:)}
{cmdab:compute:_invcovarmat}
{cmdab:unsq:uared}
{cmdab:eucl:idean}
{cmdab:disp:lay(}{it:display_options}{cmd:)}
{cmdab:verb:ose}
{cmd:float}
{cmdab:nocovtrlim:itation}
{cmdab:nomeantrlim:itation}
]
{title:Description}
{p 4 4 2}
{cmd:mahascore} generates a (squared, by default) Mahalanobis distance measure between every
observation and a single tuple of reference values which can be one of...{p_end}
{p 8 8 2}{c -} the tuple of values in a specified reference observation, using the {cmd:refobs} option;{p_end}
{p 8 8 2}{c -} a tuple of values passed in, using the {cmd:refvals} option;{p_end}
{p 8 8 2}{c -} the means of the variables of {it:varlist}, using the {cmd:refmeans} option.{p_end}
{p 4 4 2}
{cmd:mahascore} is used by {help mahascores} and {help mahapick}, but may be used
independently as well.
{p 4 4 2}
{it:varlist} (the "covariates") is a list of numeric variables on which to build
the distance measure.
These variables should be of numeric significance, not categorical; any
categorical variables should be replaced by a set of indicator variables.
{p 4 4 2}
Weights are allowed, but apply only under the
{cmd:compute_invcovarmat} and {cmd:refmeans} options.
{p 4 4 2}
By default, the result is actually the square of the Mahalanobis
distance measure. You can use the {cmd:unsquared} option to give you the
proper unsquared value.
But note that in most usages, the resulting values are used in comparisons
or sortings; the proportional magnitude is not significant, so the squared
values are just as good.
{col 12}{hline}
{p 12 12 12}
{hi:Technical note:} As of 26mar2008, {cmd:mahascore} is revised to produce
a true Mahalanobis measure; previously, it produced the normalized Euclidean
measure. See the {cmd:euclidean} option for further explanation.
{p_end}
{col 12}{hline}
{title:Options}
{p 4 4 2}
In what follows, let {it:p} denote the number of variables in {it:varlist}.
{p 4 4 2}
{cmd:gen(}{it:newvar}{cmd:)} is required; it specifies the new variable which will contain
the generated distance measure. Its default type is double.
{p 4 4 2}
{cmd:float} specifies that the type of {it:newvar} will be float, rather than
double.
{p 4 4 2}
{cmd:refobs(}{it:#}{cmd:)} specifies an integer in the range 1 to _N,
indicating the reference observation. For example, if {it:#} = 12, then the
generated measure will be calculated between each observation and observation
12.
{p 4 4 2}
{cmd:refvals(}{it:refvalsmat}{cmd:)} enables you to pass in a tuple of
values to use as the comparison values; i.e., the distances will be measured
between each observation and this tuple. {it:refvalsmat} must be
a column vector (a {it:p}-by-1 matrix) whose entries correspond to
the variables in {it:varlist}, and whose rownames equal the names in
{it:varlist} in the same order.
An example of how to do this is given below.
{p 4 4 2}
{cmd:refmeans} specifies that the tuple of reference values shall be be the
means
of the variables of {it:varlist}; this is often referred to as the centroid
of {it:varlist}.
Note that the means are computed subject to weighting, as well as limitation by
{it:treatedvar} if the {cmd:treated()} option is specified.
(But see {cmd:nomeantrlimitation}.)
Also, see the discussion of multivariate outliers in the {ul:Remarks}
section.
{p 4 4 2}
{cmd:refobs()}, {cmd:refvals()}, and {cmd:refmeans} are alternatives; one of
them must be specified.
{p 4 4 2}
{cmd:invcovarmat(}{it:invcovarmat}{cmd:)} specifies the name of a matrix
to be used in the computation described under {ul:Remarks}. It is presumably the
inverse covariance matrix of {it:varlist}, but the only requirement is that
it be a square {it:p}-by-{it:p} matrix, and both the row and column names
must equal the names in {it:varlist} in the same order as in {it:varlist}.
{p 4 4 2}
You can use {help covariancemat} to help construct the inverse covariance matrix;
it should be followed by a {cmd: mat} ... {cmd: = inv()} operation.
An example is given below, in the {ul:Examples} section.
See further discussion of the purpose of this option, under {ul:Remarks}.
{p 4 4 2}
{cmd:compute_invcovarmat} specifies that you want the inverse covariance
matrix to be computed, rather than passed in (via {cmd:invcovarmat()}).
This computation is subject to weighting, as well as limitation by
{it:treatedvar} if the {cmd:treated()} option is specified.
(But see {cmd:nocovtrlimitation}.)
Note that this will call {help covariancemat}, which computes covariances
limited to observations with all variables of {it:varlist} nonmissing.
(I.e., it is potentially different from the pairwise computation of covariances.)
{p 4 4 2}
{cmd:invcovarmat()} and {cmd:compute_invcovarmat} are alternatives; one of
them must be specified. If both are specified, then {cmd:compute_invcovarmat}
takes precedence.
{p 4 4 2}
{cmd:treated(}{it:treatedvar}{cmd:)}
specifies a numeric variable that
distinguishes the "treated" observations, with values of 0 and non-zero
signifying not-treated and treated, respectively. See {help mahapick} for an
explanation of the concept of the treated set.
This option affects only the actions of the {cmd:compute_invcovarmat} and
{cmd:refmeans} options; these computations are limited to the
set of observations for which {it:treatedvar} is non-zero, if {cmd:treated()}
is specified. See {cmd:nocovtrlimitation} and {cmd:nomeantrlimitation} for
how to control those limitations.
{p 4 4 2}
{cmd:euclidean} takes effect only if {cmd:compute_invcovarmat} is also specified.
It specifies that the off-diagonal elements of the covariance
matrix are to be replaced with zeroes, which yields the normalized Euclidean
distance measure. (This option applies only with {cmd:compute_invcovarmat}
because the zeroing of off-diagonal elements is done to the covariance
matrix {c -} i.e., prior to inversion.
If you prefer this measure and are providing the matrix via the {cmd:invcovarmat()}
option, you should zero-out the off-diagonal elements prior to inverting
{c -} or directly construct a matrix of reciprocal variances.
Note that if the diagonal elements of a matrix are c1, c2, ..., c{it:p}, and
all other elements are zero, then its inverse consists of 1/c1, 1/c2, ..., 1/c{it:p}
on the diagonal and zero elsewhere.)
See more about this under {ul:Remarks}.
{p 4 4 2}
{cmd:display(}{it:display_options}{cmd:)} turns on the display of certain
data structures used in the computation. If {it:display_options} contains
{cmd:covar}, then the covariance matrix is listed;
if it contains {cmd:invcov}, then the inverse covariance matrix is listed;
if it contains {cmd:means} and the {cmd:refmeans} option was specified, then the vector of means
is listed. Any other content is ignored.
{p 4 4 2}
If the inverse covariance matrix is displayed, it may be either
{it:invcovarmat} or that which is computed as directed by the
{cmd:compute_invcovarmat} option.
This may be useful in debugging or just to assure
you that the same set of (inverse) covariances are being used in repeated calls.
{p 4 4 2}
{cmd:unsquared} modifies the results to be the unsquared values, that is, the
square roots of the default values.
{p 4 4 2}
{cmd:verbose} specifies that a line will be written, indicating some of the
options specified.
{p 4 4 2}
{cmd:nocovtrlimitation} specifies that the covariance computation
(for {cmd:compute_invcovarmat}) not be limited to treated observations.
{p 4 4 2}
{cmd:nomeantrlimitation} specifies that the mean computation
(for {cmd:refmeans}) not be limited to treated observations.
{p 4 4 2}
Specifying both {cmd:nocovtrlimitation} and {cmd:nomeantrlimitation}
is equivalent to not specifying {cmd:treated()}. Thus, it makes sense to use
only one of them, if any.
{title:Remarks}
{p 4 4 2}
The (squared) distance measure generated is the matrix product d'Xd, where d is a vector
of differences in the set of variables, and X is either the inverse of the
covariance matrix of {it:varlist}, or is a specified matrix that is provided via
the {cmd:invcovarmat()} option.
{p 4 4 2}
The difference vector d is taken between each
observation and the tuple of reference values. That is,
d= (v1-{it:ref1} \ v2-{it:ref2} \ ... \ v{it:p}-{it:refp}), where v1 v2 ... v{it:p}
are the variables of {it:varlist}, and {it:ref1}, {it:ref2},... {it:refp}
are the reference values. In particular, under the {cmd:refobs(}{it:#}{cmd:)}
option, {it:ref1}=v1[{it:#}], {it:ref2}=v2[{it:#}], etc.
{p 4 4 2}
Thus, the generated value is the sum of all the possible products of
pairs of elements of d, weighted by corresponding elements of X.
This includes components that are the
squares of elements of d, weighted by the elements on the diagonal of X, plus
other products (of differing elements of d), weighted by the off-diagonal
elements of X.
{p 4 4 2}
Note that the generated value (for each observation) is a single number, though
technically it is a 1-by-1 matrix. It is expected to be >=0 if X is truly an
inverse covariance matrix, as such matrices are known to be positive semi-definite.
However, if X is an arbitrary matrix, then there is no guarantee that the
result will be nonnegative.
{p 4 4 2}
There are two purposes for the {cmd:invcovarmat()} option.
First, it can save unnecessary repeated calculations whenever
{cmd:mahascore} is repeatedly called on the same dataset {c -} which is
typically done as you step through a set of reference observations. Secondly,
you may want to compute the inverse covariance matrix in some way
not provided for. For example, you might compute the inverse covariance matrix
on some large set of observations, and then run {cmd:mahascore} on
a subset or several subsets {c -} but using this common set of covariances.
This latter situation occurs in {help mahapick} when using the
{cmd:sliceby()} option. (If it were not for this option
then the covariances would be recalculated on each subset {c -}
differently.)
{p 4 4 2}
The {cmd:refvals()} option is expected to be rarely used. Potentially, it may
save unnecessary repeated calculations {c -} analogous to one of the uses
of {cmd:invcovarmat()}.
Another use might be if you want the reference means and the inverse covariance
matrix to be computed differently in regard to how they are affected by
the {cmd: treated()} option or weights.
{p 4 4 2}
The {cmd:refmeans} option can be useful in detecting multivariate
outliers: tuples of values that are judged to be outliers when all the
variables are considered together, but where the
values are not necessarily outliers when the variables are
considered separately.
See http://matlabdatamining.blogspot.com/2006/11/mahalanobis-distance.html
for an explanation of this phenomenon.
{p 4 4 2}
The {cmd:euclidean} option, combined with {cmd:compute_invcovarmat}, yields
the normalized Euclidean distance. It can be considered as a simplified version
of the true Mahalanobis measure, and is less thorough in that it ignores
correlations between different variables of {it:varlist}.
It suffers from the flaw that highly correlated variables can act together
as one variable but with disproportional weight. Another way to characterize
it is that it presumes that the data are configured in ellipsoids that are
oriented parallel to the axes. Also, it may fail to detect multivariate
outliers.
{p 4 4 2}
The normalized Euclidean measure is probably less desirable than the true
Mahalanobis measure; it is provided as
a comparison measure, and it replicates the behavior of the earlier
{cmd:mahascore} and {cmd:mahapick} programs. Some experimentation has shown
that, while the values of the two measures are different, they may often yield
orderings (i.e., if you {help sort} on these measures) that are similar.
Of course, this phenomenon may be highly data-dependent, and may vary especially
if highly correlated variables are present.
{col 12}{hline}
{p 12 12 12}
{hi:Technical note:}
The non-normalized Euclidean measure is not provided for by {cmd:mahascore},
but is available in {help matrix dissimilarity} (beginning with Stata 9).
It suffers from sensitivity to the scale of measurment; e.g., is income in
dollars or thousands of dollars? The normalized Euclidean measure is a first
step in improving this measure in that it corrects the
problem of measurement scale. The true Mahalanobis measure goes one step
further in that it accounts for correlation between variables.
{p_end}
{col 12}{hline}
{p 4 4 2}
If any of these conditions occur, then the resulting measure will be missing.
{p 8 8 2}
Any covariate (variable in {it:varlist}) is missing in either the reference
observation or the observation for which the measure is being calculated.
(Thus, if any covariate is missing in the reference observation, then the
result will be universally missing.)
{p 8 8 2}
Any of the inverse covariance elements are missing.
This would cause the result to be universally missing.
{p 4 4 2}
If the inverse covariance matrix is computed on a very small set of
observations, it may not be valid and may yield strange results. It
might fail to be positive semi-definite, and can yield negative measures.
(It may also cause the {cmd:unsquared} option to have a real effect on
comparisons and sortings of the results.)
{p 4 4 2}
This computes a measure based on a single tuple of reference values:
the values in a specified reference observation, the means of {it:varlist},
or an explicit tuple of values. Thus, it generates a single variable.
In some situations (e.g., searching for multivariate outliers), that
may be all you need, but in other situations, you may want to obtain the distance
measures with respect to a multitude of reference observation, thus generating
what is logically a rectangular array of values.
(This is why there is a provision to pass in the inverse
covariance matrix, rather than recomputing the same matrix for each step.)
You may or may not want to
keep all these values; you may want to make use of the values for one
reference observation, discard them and go on to the next reference
observation.
Users who wish to do these sorts of operations should consider
{help mahascores} or {help mahapick}. {cmd:mahascores} stores all the values
from a multitude of reference values; {cmd:mahapick} selects several
observations deemed to be closest matches (lowest scores). (The latter is
an example of using the score values and then discarding them.)
{p 4 4 2}
It may help to understand two distinct types of weightings that can occur in
{cmd:mahascore}. Data weights, if specified,
affect the computation of the inverse covariance matrix if {cmd:compute_invcovarmat}
is specified, as well as the means calculation under {cmd:refmeans}.
Once this inverse covariance matrix has been established, it serves as a
set of weights for computing the distance measure.
The former weighting is observation-oriented; the latter is variable-oriented.
{title:Examples}
{p 4 8 2}
{cmd:. mahascore income age numkids, gen(dist1) refobs(12) invcovarmat(`v')}
{p 4 8 2}
{cmd:. mahascore income age numkids, gen(dist2) refobs(`j')}
{cmd:treated(assisted) compute_invcov}
{p 4 4 2}
To create your own inverse covariance matrix:
{p 4 8 2}
{cmd:. local vars "income age numkids"}{p_end}
{p 4 8 2}
{cmd:. covariancemat `vars' in 1/15, covarmat(M)}{p_end}
{p 4 8 2}
{cmd:. mat MINV = inv(M) // or possibly invsym(M)}{p_end}
{p 4 8 2}
{cmd:. forvalues j = 1/15 {c -(}}{p_end}
{p 4 8 2}
{cmd:. mahascore `vars', gen(dist`j') refobs(`j') invcovarmat(MINV)}{p_end}
{p 4 8 2}
{cmd:. {c )-}}
{p 4 4 2}
To create your own reference values:
{p 4 8 2}
{cmd:. local vars "income age numkids"}{p_end}
{p 4 8 2}
{cmd:. matrix V = (20000 \ 25 \ 2)}{p_end}
{p 4 8 2}
{cmd:. matrix rownames V = `vars'}{p_end}
{p 4 8 2}
{cmd:. mahascore `vars', gen(dist) refvals(V) compute}{p_end}
{title:Acknowledgement}
{p 4 4 2}
The author wishes to thank Joseph Harkness, formerly of The
Institute for Policy Studies
at Johns Hopkins University for guidance in developing this program,
as well as Heiko Giebler of Wissenschaftszentrum Berlin fur Sozialforschung
GmbH, for suggesting further improvements.
{title:Author}
{p 4 4 2}
David Kantor; initial development was done at The Institute for Policy Studies,
Johns Hopkins University.
Email {browse "mailto:kantor.d@att.net":kantor.d@att.net} if you observe any
problems.
{title:Also See}
{p 4 4 2}
{help mahapick}, {help mahascores}, {help mahascore2}, {help covariancemat}, {help variancemat},
{help screenmatches}, {help stackids}.