------------------------------------------------------------------------------- help for

mahascores-------------------------------------------------------------------------------

Generate a set of Mahalanobis distance measures

mahascoresvarlist[weight],[idvar(idvar)varprefix(varprefix)genmat(genmat)genfile(filename)name1(name1)name2(name2)scorevar(scorevar)replacetreated(treatedvar)invcovarmat(invcovarmat)compute_invcovarmatdisplay(display_options) full allunsquaredeuclideanverbosefloattransposenocovtrlimitation]

Description

mahascoresgenerates a Mahalanobis distance measure between every pair of observations, or possibly between selected pairs of observations (under thetreated()option). By default, the result is actually the square of the proper Mahalanobis measure. You can use the unsquared option to give you the unsquared value, but note that in most cases, the resulting values are used in comparisons or sortings; the proportional magnitude is not significant, so the squared values are just as good.

varlist(the "covariates") is a list of numeric variables on which to build the distance measure. These variables should be of numeric significance, not categorical; any categorical variables should be replaced by a set of indicator variables.See mahascore for an explanation of the Mahalanobis measure.

Weights are allowed, but apply only under the

compute_invcovarmatoption.There are three means of getting the output: - a set of generated variables, using the

varprefix()option; - a matrix, using thegenmat()option; - a separate file, using thegenfile()option.

OptionsIn what follows, let

pdenote the number of variables invarlist.

idvar(idvar)is an identifying variable which is used to mark the components of the output: - with thevarprefix()option, its values become part of the new variable names; - with thegenmat()option, its values are used as matrix row and column names; - with thegenfile()option, its values go into the primary and secondary identifying variables.

idvarcan be of any type, but it must be a single variable. If the existing identifying scheme consists of multiple variables, you should devise a way to combine them uniquely into a single variable. Numbers are acceptable, but they should be integers.It is desirable and often essential (depending on the output options) that

idvaruniquely identify observations. Under the thevarprefix()option, it is essential that the contents ofidvarbe acceptable as suffixes on variable names; avoid embedded spaces and characters that are not acceptable in variable names.--------------------------------------------------------------------

technical notes:With the

varprefix()option, illegal characters, embedded spaces, or non-unique values may result in a fatal error.With the

genmat()option, embedded spaces may cause the row or column names to "slip over" to the wrong row or column. Non-unique values do not cause an immediate error, but may cause confusing labelling of the columns and rows and may cause errors in later use of the matrix.With the

genfile()option, the form of the values inidvaris not critical, but if they don't uniquely identify observations, then it will be difficult to use the resulting file. --------------------------------------------------------------------

idvar()is optional. If it is omitted then the following values are used: - with thevarprefix()option, 1, 2, etc., that is, the variable names arevarprefix1,varprefix2, etc; - with thegenmat()option, obs1, obs2, etc. as row and column names; - with thegenfile()option, 1, 2, etc. But note that these numbers refer to the observations in the present order and become meaningless after a sort. Thus, theidvar()provides a more secure way of identifying the results.The

varprefix(),genmat(), andgenfile()options are nonexclusive alternatives for obtaining the output; at least one of them must be used.

varprefix(varprefix)specifies that the results will be placed in a set of new variables, one for each observation. These variables will be named with a common prefixvarprefix, and the remainder of the names are the values inidvar, or the observation numbers ifidvar()is omitted. See the notes regarding acceptable content foridvar, above. Note that this option can generate a potentially very large set of variables - as many as there are observations (thus, constituting a square array of values), though that may set be reduced under thetreated()option. (See remarks under treated() for more on that matter.) The default type for these variables is double.

genmat(genmat)specifies that the results will be placed in a matrix namedgenmat. The row and column names will be taken from the values inidvar, or will be obs1, obs2, etc., ifidvar()is omitted. See the notes regarding acceptable content foridvar, above. Ifgenmatalready exists as a matrix, it will be overwritten. See additional remarks undertreated()regarding which rows and columns will be included.

genmat()potentially creates a very large matrix. You may need to set matsize to a large value to enable this matrix to be created.

genfile(filename)specifies that the results will be placed in a separate dataset in long form. See reshape for an explanation of long form.Under the

genfile()option, the resulting file is a Stata dataset with these variables:A primary and secondary id variable which refer to observations in the dataset from which the measures were derived.

A variable to hold the distance measure, measured between the observations identified in the primary and secondary id variables. The default name is _score, and its default type is double.

The types, content, and default names of the primary and secondary id variables depend on whether

idvar()is specified:If

idvar()is specified, then these variables are of the same type asidvar, and contain values fromidvarcorresponding to the pertinant observations. Their default names are _refid andidvar.If

idvar()is omitted, then they are integer types, and contain the corresponding observation numbers. Their default names are _refobs and _obs.Note that each of the three output options has two distinct entities that locate a distance measure value. We will identify one as primary and the other as secondary. The primary entities are...

for

varprefix(), the variables generated; forgenmat(), the rows of the matrix; forgenfile(), the primary id variable.The secondary entities are...

for

varprefix(), the observations of the dataset (with values placed in the generated variables); forgenmat(), the columns of the matrix; forgenfile(), the secondary id variable.Thus, the distance measure represents a difference measured from from the observation identified by the primary entity to the observation identified by the secondary entity; the distance in the other direction is the same. Consequently, the distinction between the primary and secondary entities often becomes immaterial, due to the symmetry of the situation. However, there is a situation where we choose to make a distinction, and the resulting set of values is asymmetric. In particular, this occurs under the

treated()option, which will be described below.

Options for use withgenfile

name1(name1)allows you to specify the name for the primary id variable. The default name depends on whetheridvar()is specified, as explained above.

name2(name2)allows you to specify the name for the secondary id variable. The default name depends on whetheridvar()is specified, as explained above.

scorevar(scorevar)allows you to specify the name of the distance measure variable. The default name is _score.

replacespecifies that if the file already exists, it will be replaced.

More Options

invcovarmat(invcovarmat)specifies the name of a matrix to be used in the computation of the distance measure. It is presumably the inverse covariance matrix ofvarlist, but the only requirement is that it be a squarep-by-pmatrix, and both the row and column names must equal the names invarlistin the same order as invarlist.

invcovarmat(invcovarmat)is expected to be rarely used; it is provided in case the user wishes to supply an existing inverse covariance matrix, or one computed in some special way not provided for by the available options. Additionally, it might enable some efficiency advantage if repeated calls are made requiring the same inverse covariance matrix. For most usages, however, you probably want thecompute_invcovarmatoption

compute_invcovarmatspecifies that you want the inverse covariance matrix to be computed, rather than passed in (viainvcovarmat()). This computation is subject to weighting, as well as limitation bytreatedvarif thetreated()option is specified. (But seenocovtrlimitation.) Note that this will call covariancemat, which computes covariances limited to observations with all variables ofvarlistnonmissing. (I.e., it is potentially different from the pairwise computation of covariances.)

invcovarmat()andcompute_invcovarmatare alternatives; one of them must be specified. If both are specified, thencompute_invcovarmattakes precedence.

treated(treatedvar)specifies a numeric variable that distinguishes the "treated" observations, with values of 0 and non-zero signifying non-treated and treated, respectively. See mahapick for an explanation of the concept of the treated set. This option affects the action of thecompute_invcovarmatin that the computation is limited to the set of observations for whichtreatedvaris non-zero, iftreated()is specified. Seenocovtrlimitationfor how to control that limitation.

treated()also potentially limits the set of values that are output. In generic terms, the default action is that primary entities are associated with (limited to) the treated observations, and the secondary entities are associated with (limited to) the non-treated observations. (One exception: the secondary entities of thevarprefix()option - the placing of values in the generated variables - are never limited in this way.) More specifically,With

varprefix(), only the variables corresponding to the treated observations will be generated.With

genmat(), only the rows corresponding to treated cases are generated; only the columns corresponding to non-treated cases are generated.With

genfile, only observations with primary id corresponding to treated cases, and with secondary id corresponding to non-treated cases are generated.The rationale is that, with

treated(), you would only be interested in distance measurments from a treated observation to a non-treated. (And these limitations save space as well.)The

alloption lifts these limitations entirely; both the primary and secondary entities will range over all observations, yielding a square symmetric result.The

fulloption lifts the restriction on the secondary entities; all possible secondary id values or matrix columns are generated. (It has no effect on thevarprefix()results, as the secondary entities forvarprefix()are never limited bytreated().)In other words, the variables generated, the rows of the matrix, or the primary id variable, correspond to... the treated observations, if

treated()is specified withoutall; all observations, otherwise. the colums of the matrix, or the secondary id variable correspond to... the non-treated observations, iftreated()is specified withoutallorfull; all observations, otherwise.Note that

allimpliesfull; there is no provision for generating all primary id values (or matrix rows) without also getting all secondary id values (or matrix columns).

unsquaredmodifies the results to be the unsquared values, that is, the square roots of the default values.

euclideantakes effect only ifcompute_invcovarmatis also specified. This specifies that the normalized Euclidean measure is to be used, rather than the true Mahalanobis measure - meaning that the off-diagonal elements of the covariance matrix are replaced with zeroes prior to inverting. The result is a measure that accounts for the scale of measurement in each variable ofvarlist, but ignores correlation between the variables. This is probably not desirable, given the advantages of the true Mahalanobis measure, but is provided as an alternative and for comparison to (or emulation of) earlier releases of mahascore and mahapick. See mahascore for more details on this matter.

floatspecifies that the type for the variables generated byvarprefix()or for the distance measure (orscorevar) generated bygenfile()will be float, rather than double. This has no effect ongenmat(), as matrices always contain doubles.

display(display_options)turns on the display of certain data structures used in the computation. Ifdisplay_optionscontainscovar, then the covariance matrix is listed; if it containsinvcov, then the inverse covariance matrix is listed. Any other content is ignored.

verbosetakes effect only ifcompute_invcovarmatis also specified. This causes each call to mahascore to be reported, along with information about what options were specified.

transposespecifies that the matrix (undergenmat()) is to be transposed.

nocovtrlimitationspecifies that the covariance computation (forcompute_invcovarmat) not be limited to treated observations.

RemarksIf the inverse covariance matrix is computed on a very small set of observations, it may not be valid and may yield strange results. It might fail to be positive semi-definite, and can yield negative measures. (It may also cause the

unsquaredoption to have a real effect on comparisons and sortings of the results.)Please see mahascore for more information on the computation of the Mahalanobis measure.

Examples

. mahascores income age numkids edlevel, idvar(persno) varprefix(d1_)treated(assisted) compute_invcov

. mahascores income age numkids edlevel, idvar(persno) genmat(m1)treated(assisted) compute_invcov

. mahascores income age numkids edlevel, idvar(persno) genfile(dist1)compute_invcov scorevar(d1)

The author wishes to thank Heiko Giebler of Wissenschaftszentrum Berlin fur Sozialforschung GmbH, for suggestion leading to the development of this program.Acknowledgement

David Kantor. Email kantor.d@att.net if you observe any problems.Author

mahascore, mahascore2, mahapick, covariancemat, variancemat,Also See