------------------------------------------------------------------------------- help for

mahascore-------------------------------------------------------------------------------

Generate a Mahalanobis distance measure

mahascorevarlist[weight],gen(newvar)[refobs(#)refvals(refvalsmat)refmeans treated(treatedvar)invcovarmat(invcovarmat)compute_invcovarmatunsquaredeuclideandisplay(display_options)verbosefloatnocovtrlimitationnomeantrlimitation]

Description

mahascoregenerates a (squared, by default) Mahalanobis distance measure between every observation and a single tuple of reference values which can be one of... - the tuple of values in a specified reference observation, using therefobsoption; - a tuple of values passed in, using therefvalsoption; - the means of the variables ofvarlist, using therefmeansoption.

mahascoreis used by mahascores and mahapick, but may be used independently as well.

varlist(the "covariates") is a list of numeric variables on which to build the distance measure. These variables should be of numeric significance, not categorical; any categorical variables should be replaced by a set of indicator variables.Weights are allowed, but apply only under the

compute_invcovarmatandrefmeansoptions.By default, the result is actually the square of the Mahalanobis distance measure. You can use the

unsquaredoption to give you the proper unsquared value. But note that in most usages, the resulting values are used in comparisons or sortings; the proportional magnitude is not significant, so the squared values are just as good.--------------------------------------------------------------------

Technical note:As of 26mar2008,mahascoreis revised to produce a true Mahalanobis measure; previously, it produced the normalized Euclidean measure. See theeuclideanoption for further explanation. --------------------------------------------------------------------

OptionsIn what follows, let

pdenote the number of variables invarlist.

gen(newvar)is required; it specifies the new variable which will contain the generated distance measure. Its default type is double.

floatspecifies that the type ofnewvarwill be float, rather than double.

refobs(#)specifies an integer in the range 1 to _N, indicating the reference observation. For example, if#= 12, then the generated measure will be calculated between each observation and observation 12.

refvals(refvalsmat)enables you to pass in a tuple of values to use as the comparison values; i.e., the distances will be measured between each observation and this tuple.refvalsmatmust be a column vector (ap-by-1 matrix) whose entries correspond to the variables invarlist, and whose rownames equal the names invarlistin the same order. An example of how to do this is given below.

refmeansspecifies that the tuple of reference values shall be be the means of the variables ofvarlist; this is often referred to as the centroid ofvarlist. Note that the means are computed subject to weighting, as well as limitation bytreatedvarif thetreated()option is specified. (But seenomeantrlimitation.) Also, see the discussion of multivariate outliers in theRemarkssection.

refobs(),refvals(), andrefmeansare alternatives; one of them must be specified.

invcovarmat(invcovarmat)specifies the name of a matrix to be used in the computation described underRemarks. It is presumably the inverse covariance matrix ofvarlist, but the only requirement is that it be a squarep-by-pmatrix, and both the row and column names must equal the names invarlistin the same order as invarlist.You can use covariancemat to help construct the inverse covariance matrix; it should be followed by a

mat...= inv()operation. An example is given below, in theExamplessection. See further discussion of the purpose of this option, underRemarks.

compute_invcovarmatspecifies that you want the inverse covariance matrix to be computed, rather than passed in (viainvcovarmat()). This computation is subject to weighting, as well as limitation bytreatedvarif thetreated()option is specified. (But seenocovtrlimitation.) Note that this will call covariancemat, which computes covariances limited to observations with all variables ofvarlistnonmissing. (I.e., it is potentially different from the pairwise computation of covariances.)

invcovarmat()andcompute_invcovarmatare alternatives; one of them must be specified. If both are specified, thencompute_invcovarmattakes precedence.

treated(treatedvar)specifies a numeric variable that distinguishes the "treated" observations, with values of 0 and non-zero signifying not-treated and treated, respectively. See mahapick for an explanation of the concept of the treated set. This option affects only the actions of thecompute_invcovarmatandrefmeansoptions; these computations are limited to the set of observations for whichtreatedvaris non-zero, iftreated()is specified. Seenocovtrlimitationandnomeantrlimitationfor how to control those limitations.

euclideantakes effect only ifcompute_invcovarmatis also specified. It specifies that the off-diagonal elements of the covariance matrix are to be replaced with zeroes, which yields the normalized Euclidean distance measure. (This option applies only withcompute_invcovarmatbecause the zeroing of off-diagonal elements is done to the covariance matrix - i.e., prior to inversion. If you prefer this measure and are providing the matrix via theinvcovarmat()option, you should zero-out the off-diagonal elements prior to inverting - or directly construct a matrix of reciprocal variances. Note that if the diagonal elements of a matrix are c1, c2, ..., cp, and all other elements are zero, then its inverse consists of 1/c1, 1/c2, ..., 1/cpon the diagonal and zero elsewhere.) See more about this underRemarks.

display(display_options)turns on the display of certain data structures used in the computation. Ifdisplay_optionscontainscovar, then the covariance matrix is listed; if it containsinvcov, then the inverse covariance matrix is listed; if it containsmeansand therefmeansoption was specified, then the vector of means is listed. Any other content is ignored.If the inverse covariance matrix is displayed, it may be either

invcovarmator that which is computed as directed by thecompute_invcovarmatoption. This may be useful in debugging or just to assure you that the same set of (inverse) covariances are being used in repeated calls.

unsquaredmodifies the results to be the unsquared values, that is, the square roots of the default values.

verbosespecifies that a line will be written, indicating some of the options specified.

nocovtrlimitationspecifies that the covariance computation (forcompute_invcovarmat) not be limited to treated observations.

nomeantrlimitationspecifies that the mean computation (forrefmeans) not be limited to treated observations.Specifying both

nocovtrlimitationandnomeantrlimitationis equivalent to not specifyingtreated(). Thus, it makes sense to use only one of them, if any.

RemarksThe (squared) distance measure generated is the matrix product d'Xd, where d is a vector of differences in the set of variables, and X is either the inverse of the covariance matrix of

varlist, or is a specified matrix that is provided via theinvcovarmat()option.The difference vector d is taken between each observation and the tuple of reference values. That is, d= (v1-

ref1\ v2-ref2\ ... \ vp-refp), where v1 v2 ... vpare the variables ofvarlist, andref1,ref2,...refpare the reference values. In particular, under therefobs(#)option,ref1=v1[#],ref2=v2[#], etc.Thus, the generated value is the sum of all the possible products of pairs of elements of d, weighted by corresponding elements of X. This includes components that are the squares of elements of d, weighted by the elements on the diagonal of X, plus other products (of differing elements of d), weighted by the off-diagonal elements of X.

Note that the generated value (for each observation) is a single number, though technically it is a 1-by-1 matrix. It is expected to be >=0 if X is truly an inverse covariance matrix, as such matrices are known to be positive semi-definite. However, if X is an arbitrary matrix, then there is no guarantee that the result will be nonnegative.

There are two purposes for the

invcovarmat()option. First, it can save unnecessary repeated calculations whenevermahascoreis repeatedly called on the same dataset - which is typically done as you step through a set of reference observations. Secondly, you may want to compute the inverse covariance matrix in some way not provided for. For example, you might compute the inverse covariance matrix on some large set of observations, and then runmahascoreon a subset or several subsets - but using this common set of covariances. This latter situation occurs in mahapick when using thesliceby()option. (If it were not for this option then the covariances would be recalculated on each subset - differently.)The

refvals()option is expected to be rarely used. Potentially, it may save unnecessary repeated calculations - analogous to one of the uses ofinvcovarmat(). Another use might be if you want the reference means and the inverse covariance matrix to be computed differently in regard to how they are affected by thetreated()option or weights.The

refmeansoption can be useful in detecting multivariate outliers: tuples of values that are judged to be outliers when all the variables are considered together, but where the values are not necessarily outliers when the variables are considered separately. See http://matlabdatamining.blogspot.com/2006/11/mahalanobis-distance.html for an explanation of this phenomenon.The

euclideanoption, combined withcompute_invcovarmat, yields the normalized Euclidean distance. It can be considered as a simplified version of the true Mahalanobis measure, and is less thorough in that it ignores correlations between different variables ofvarlist. It suffers from the flaw that highly correlated variables can act together as one variable but with disproportional weight. Another way to characterize it is that it presumes that the data are configured in ellipsoids that are oriented parallel to the axes. Also, it may fail to detect multivariate outliers.The normalized Euclidean measure is probably less desirable than the true Mahalanobis measure; it is provided as a comparison measure, and it replicates the behavior of the earlier

mahascoreandmahapickprograms. Some experimentation has shown that, while the values of the two measures are different, they may often yield orderings (i.e., if you sort on these measures) that are similar. Of course, this phenomenon may be highly data-dependent, and may vary especially if highly correlated variables are present.--------------------------------------------------------------------

Technical note:The non-normalized Euclidean measure is not provided for bymahascore, but is available in matrix dissimilarity (beginning with Stata 9). It suffers from sensitivity to the scale of measurment; e.g., is income in dollars or thousands of dollars? The normalized Euclidean measure is a first step in improving this measure in that it corrects the problem of measurement scale. The true Mahalanobis measure goes one step further in that it accounts for correlation between variables. --------------------------------------------------------------------

If any of these conditions occur, then the resulting measure will be missing.

Any covariate (variable in

varlist) is missing in either the reference observation or the observation for which the measure is being calculated. (Thus, if any covariate is missing in the reference observation, then the result will be universally missing.)Any of the inverse covariance elements are missing. This would cause the result to be universally missing.

If the inverse covariance matrix is computed on a very small set of observations, it may not be valid and may yield strange results. It might fail to be positive semi-definite, and can yield negative measures. (It may also cause the

unsquaredoption to have a real effect on comparisons and sortings of the results.)This computes a measure based on a single tuple of reference values: the values in a specified reference observation, the means of

varlist, or an explicit tuple of values. Thus, it generates a single variable. In some situations (e.g., searching for multivariate outliers), that may be all you need, but in other situations, you may want to obtain the distance measures with respect to a multitude of reference observation, thus generating what is logically a rectangular array of values. (This is why there is a provision to pass in the inverse covariance matrix, rather than recomputing the same matrix for each step.) You may or may not want to keep all these values; you may want to make use of the values for one reference observation, discard them and go on to the next reference observation. Users who wish to do these sorts of operations should consider mahascores or mahapick.mahascoresstores all the values from a multitude of reference values;mahapickselects several observations deemed to be closest matches (lowest scores). (The latter is an example of using the score values and then discarding them.)It may help to understand two distinct types of weightings that can occur in

mahascore. Data weights, if specified, affect the computation of the inverse covariance matrix ifcompute_invcovarmatis specified, as well as the means calculation underrefmeans. Once this inverse covariance matrix has been established, it serves as a set of weights for computing the distance measure. The former weighting is observation-oriented; the latter is variable-oriented.

Examples

. mahascore income age numkids, gen(dist1) refobs(12) invcovarmat(`v')

. mahascore income age numkids, gen(dist2) refobs(`j')treated(assisted)compute_invcovTo create your own inverse covariance matrix:

. local vars "income age numkids". covariancemat `vars' in 1/15, covarmat(M). mat MINV = inv(M) // or possibly invsym(M). forvalues j = 1/15 {. mahascore `vars', gen(dist`j') refobs(`j') invcovarmat(MINV). }To create your own reference values:

. local vars "income age numkids". matrix V = (20000 \ 25 \ 2). matrix rownames V = `vars'. mahascore `vars', gen(dist) refvals(V) compute

The author wishes to thank Joseph Harkness, formerly of The Institute for Policy Studies at Johns Hopkins University for guidance in developing this program, as well as Heiko Giebler of Wissenschaftszentrum Berlin fur Sozialforschung GmbH, for suggesting further improvements.Acknowledgement

David Kantor; initial development was done at The Institute for Policy Studies, Johns Hopkins University. Email kantor.d@att.net if you observe any problems.Author

mahapick, mahascores, mahascore2, covariancemat, variancemat,Also See