------------------------------------------------------------------------------- help for mahascore -------------------------------------------------------------------------------

Generate a Mahalanobis distance measure

mahascore varlist [weight] , gen(newvar) [ refobs(#) refvals(refvalsmat) refmeans treated(treatedvar) invcovarmat(invcovarmat) compute_invcovarmat unsquared euclidean display(display_options) verbose float nocovtrlimitation nomeantrlimitation ]

Description

mahascore generates a (squared, by default) Mahalanobis distance measure between every observation and a single tuple of reference values which can be one of... - the tuple of values in a specified reference observation, using the refobs option; - a tuple of values passed in, using the refvals option; - the means of the variables of varlist, using the refmeans option.

mahascore is used by mahascores and mahapick, but may be used independently as well.

varlist (the "covariates") is a list of numeric variables on which to build the distance measure. These variables should be of numeric significance, not categorical; any categorical variables should be replaced by a set of indicator variables.

Weights are allowed, but apply only under the compute_invcovarmat and refmeans options.

By default, the result is actually the square of the Mahalanobis distance measure. You can use the unsquared option to give you the proper unsquared value. But note that in most usages, the resulting values are used in comparisons or sortings; the proportional magnitude is not significant, so the squared values are just as good.

-------------------------------------------------------------------- Technical note: As of 26mar2008, mahascore is revised to produce a true Mahalanobis measure; previously, it produced the normalized Euclidean measure. See the euclidean option for further explanation. --------------------------------------------------------------------

Options

In what follows, let p denote the number of variables in varlist.

gen(newvar) is required; it specifies the new variable which will contain the generated distance measure. Its default type is double.

float specifies that the type of newvar will be float, rather than double.

refobs(#) specifies an integer in the range 1 to _N, indicating the reference observation. For example, if # = 12, then the generated measure will be calculated between each observation and observation 12.

refvals(refvalsmat) enables you to pass in a tuple of values to use as the comparison values; i.e., the distances will be measured between each observation and this tuple. refvalsmat must be a column vector (a p-by-1 matrix) whose entries correspond to the variables in varlist, and whose rownames equal the names in varlist in the same order. An example of how to do this is given below.

refmeans specifies that the tuple of reference values shall be be the means of the variables of varlist; this is often referred to as the centroid of varlist. Note that the means are computed subject to weighting, as well as limitation by treatedvar if the treated() option is specified. (But see nomeantrlimitation.) Also, see the discussion of multivariate outliers in the Remarks section.

refobs(), refvals(), and refmeans are alternatives; one of them must be specified.

invcovarmat(invcovarmat) specifies the name of a matrix to be used in the computation described under Remarks. It is presumably the inverse covariance matrix of varlist, but the only requirement is that it be a square p-by-p matrix, and both the row and column names must equal the names in varlist in the same order as in varlist.

You can use covariancemat to help construct the inverse covariance matrix; it should be followed by a mat ... = inv() operation. An example is given below, in the Examples section. See further discussion of the purpose of this option, under Remarks.

compute_invcovarmat specifies that you want the inverse covariance matrix to be computed, rather than passed in (via invcovarmat()). This computation is subject to weighting, as well as limitation by treatedvar if the treated() option is specified. (But see nocovtrlimitation.) Note that this will call covariancemat, which computes covariances limited to observations with all variables of varlist nonmissing. (I.e., it is potentially different from the pairwise computation of covariances.)

invcovarmat() and compute_invcovarmat are alternatives; one of them must be specified. If both are specified, then compute_invcovarmat takes precedence.

treated(treatedvar) specifies a numeric variable that distinguishes the "treated" observations, with values of 0 and non-zero signifying not-treated and treated, respectively. See mahapick for an explanation of the concept of the treated set. This option affects only the actions of the compute_invcovarmat and refmeans options; these computations are limited to the set of observations for which treatedvar is non-zero, if treated() is specified. See nocovtrlimitation and nomeantrlimitation for how to control those limitations.

euclidean takes effect only if compute_invcovarmat is also specified. It specifies that the off-diagonal elements of the covariance matrix are to be replaced with zeroes, which yields the normalized Euclidean distance measure. (This option applies only with compute_invcovarmat because the zeroing of off-diagonal elements is done to the covariance matrix - i.e., prior to inversion. If you prefer this measure and are providing the matrix via the invcovarmat() option, you should zero-out the off-diagonal elements prior to inverting - or directly construct a matrix of reciprocal variances. Note that if the diagonal elements of a matrix are c1, c2, ..., cp, and all other elements are zero, then its inverse consists of 1/c1, 1/c2, ..., 1/cp on the diagonal and zero elsewhere.) See more about this under Remarks.

display(display_options) turns on the display of certain data structures used in the computation. If display_options contains covar, then the covariance matrix is listed; if it contains invcov, then the inverse covariance matrix is listed; if it contains means and the refmeans option was specified, then the vector of means is listed. Any other content is ignored.

If the inverse covariance matrix is displayed, it may be either invcovarmat or that which is computed as directed by the compute_invcovarmat option. This may be useful in debugging or just to assure you that the same set of (inverse) covariances are being used in repeated calls.

unsquared modifies the results to be the unsquared values, that is, the square roots of the default values.

verbose specifies that a line will be written, indicating some of the options specified.

nocovtrlimitation specifies that the covariance computation (for compute_invcovarmat) not be limited to treated observations.

nomeantrlimitation specifies that the mean computation (for refmeans) not be limited to treated observations.

Specifying both nocovtrlimitation and nomeantrlimitation is equivalent to not specifying treated(). Thus, it makes sense to use only one of them, if any.

Remarks

The (squared) distance measure generated is the matrix product d'Xd, where d is a vector of differences in the set of variables, and X is either the inverse of the covariance matrix of varlist, or is a specified matrix that is provided via the invcovarmat() option.

The difference vector d is taken between each observation and the tuple of reference values. That is, d= (v1-ref1 \ v2-ref2 \ ... \ vp-refp), where v1 v2 ... vp are the variables of varlist, and ref1, ref2,... refp are the reference values. In particular, under the refobs(#) option, ref1=v1[#], ref2=v2[#], etc.

Thus, the generated value is the sum of all the possible products of pairs of elements of d, weighted by corresponding elements of X. This includes components that are the squares of elements of d, weighted by the elements on the diagonal of X, plus other products (of differing elements of d), weighted by the off-diagonal elements of X.

Note that the generated value (for each observation) is a single number, though technically it is a 1-by-1 matrix. It is expected to be >=0 if X is truly an inverse covariance matrix, as such matrices are known to be positive semi-definite. However, if X is an arbitrary matrix, then there is no guarantee that the result will be nonnegative.

There are two purposes for the invcovarmat() option. First, it can save unnecessary repeated calculations whenever mahascore is repeatedly called on the same dataset - which is typically done as you step through a set of reference observations. Secondly, you may want to compute the inverse covariance matrix in some way not provided for. For example, you might compute the inverse covariance matrix on some large set of observations, and then run mahascore on a subset or several subsets - but using this common set of covariances. This latter situation occurs in mahapick when using the sliceby() option. (If it were not for this option then the covariances would be recalculated on each subset - differently.)

The refvals() option is expected to be rarely used. Potentially, it may save unnecessary repeated calculations - analogous to one of the uses of invcovarmat(). Another use might be if you want the reference means and the inverse covariance matrix to be computed differently in regard to how they are affected by the treated() option or weights.

The refmeans option can be useful in detecting multivariate outliers: tuples of values that are judged to be outliers when all the variables are considered together, but where the values are not necessarily outliers when the variables are considered separately. See http://matlabdatamining.blogspot.com/2006/11/mahalanobis-distance.html for an explanation of this phenomenon.

The euclidean option, combined with compute_invcovarmat, yields the normalized Euclidean distance. It can be considered as a simplified version of the true Mahalanobis measure, and is less thorough in that it ignores correlations between different variables of varlist. It suffers from the flaw that highly correlated variables can act together as one variable but with disproportional weight. Another way to characterize it is that it presumes that the data are configured in ellipsoids that are oriented parallel to the axes. Also, it may fail to detect multivariate outliers.

The normalized Euclidean measure is probably less desirable than the true Mahalanobis measure; it is provided as a comparison measure, and it replicates the behavior of the earlier mahascore and mahapick programs. Some experimentation has shown that, while the values of the two measures are different, they may often yield orderings (i.e., if you sort on these measures) that are similar. Of course, this phenomenon may be highly data-dependent, and may vary especially if highly correlated variables are present.

-------------------------------------------------------------------- Technical note: The non-normalized Euclidean measure is not provided for by mahascore, but is available in matrix dissimilarity (beginning with Stata 9). It suffers from sensitivity to the scale of measurment; e.g., is income in dollars or thousands of dollars? The normalized Euclidean measure is a first step in improving this measure in that it corrects the problem of measurement scale. The true Mahalanobis measure goes one step further in that it accounts for correlation between variables. --------------------------------------------------------------------

If any of these conditions occur, then the resulting measure will be missing.

Any covariate (variable in varlist) is missing in either the reference observation or the observation for which the measure is being calculated. (Thus, if any covariate is missing in the reference observation, then the result will be universally missing.)

Any of the inverse covariance elements are missing. This would cause the result to be universally missing.

If the inverse covariance matrix is computed on a very small set of observations, it may not be valid and may yield strange results. It might fail to be positive semi-definite, and can yield negative measures. (It may also cause the unsquared option to have a real effect on comparisons and sortings of the results.)

This computes a measure based on a single tuple of reference values: the values in a specified reference observation, the means of varlist, or an explicit tuple of values. Thus, it generates a single variable. In some situations (e.g., searching for multivariate outliers), that may be all you need, but in other situations, you may want to obtain the distance measures with respect to a multitude of reference observation, thus generating what is logically a rectangular array of values. (This is why there is a provision to pass in the inverse covariance matrix, rather than recomputing the same matrix for each step.) You may or may not want to keep all these values; you may want to make use of the values for one reference observation, discard them and go on to the next reference observation. Users who wish to do these sorts of operations should consider mahascores or mahapick. mahascores stores all the values from a multitude of reference values; mahapick selects several observations deemed to be closest matches (lowest scores). (The latter is an example of using the score values and then discarding them.)

It may help to understand two distinct types of weightings that can occur in mahascore. Data weights, if specified, affect the computation of the inverse covariance matrix if compute_invcovarmat is specified, as well as the means calculation under refmeans. Once this inverse covariance matrix has been established, it serves as a set of weights for computing the distance measure. The former weighting is observation-oriented; the latter is variable-oriented.

Examples

. mahascore income age numkids, gen(dist1) refobs(12) invcovarmat(`v')

. mahascore income age numkids, gen(dist2) refobs(`j') treated(assisted) compute_invcov

To create your own inverse covariance matrix:

. local vars "income age numkids" . covariancemat `vars' in 1/15, covarmat(M) . mat MINV = inv(M) // or possibly invsym(M) . forvalues j = 1/15 { . mahascore `vars', gen(dist`j') refobs(`j') invcovarmat(MINV) . }

To create your own reference values:

. local vars "income age numkids" . matrix V = (20000 \ 25 \ 2) . matrix rownames V = `vars' . mahascore `vars', gen(dist) refvals(V) compute

Acknowledgement The author wishes to thank Joseph Harkness, formerly of The Institute for Policy Studies at Johns Hopkins University for guidance in developing this program, as well as Heiko Giebler of Wissenschaftszentrum Berlin fur Sozialforschung GmbH, for suggesting further improvements.

Author David Kantor; initial development was done at The Institute for Policy Studies, Johns Hopkins University. Email kantor.d@att.net if you observe any problems.

Also See mahapick, mahascores, mahascore2, covariancemat, variancemat,