------------------------------------------------------------------------------- help for mahascores -------------------------------------------------------------------------------
Generate a set of Mahalanobis distance measures
mahascores varlist [weight] , [ idvar(idvar) varprefix(varprefix) genmat(genmat) genfile(filename) name1(name1) name2(name2) scorevar(scorevar) replace treated(treatedvar) invcovarmat(invcovarmat) compute_invcovarmat display(display_options) full all unsquared euclidean verbose float transpose nocovtrlimitation ]
Description
mahascores generates a Mahalanobis distance measure between every pair of observations, or possibly between selected pairs of observations (under the treated() option). By default, the result is actually the square of the proper Mahalanobis measure. You can use the unsquared option to give you the unsquared value, but note that in most cases, the resulting values are used in comparisons or sortings; the proportional magnitude is not significant, so the squared values are just as good.
varlist (the "covariates") is a list of numeric variables on which to build the distance measure. These variables should be of numeric significance, not categorical; any categorical variables should be replaced by a set of indicator variables.
See mahascore for an explanation of the Mahalanobis measure.
Weights are allowed, but apply only under the compute_invcovarmat option.
There are three means of getting the output: - a set of generated variables, using the varprefix() option; - a matrix, using the genmat() option; - a separate file, using the genfile() option.
Options
In what follows, let p denote the number of variables in varlist.
idvar(idvar) is an identifying variable which is used to mark the components of the output: - with the varprefix() option, its values become part of the new variable names; - with the genmat() option, its values are used as matrix row and column names; - with the genfile() option, its values go into the primary and secondary identifying variables.
idvar can be of any type, but it must be a single variable. If the existing identifying scheme consists of multiple variables, you should devise a way to combine them uniquely into a single variable. Numbers are acceptable, but they should be integers.
It is desirable and often essential (depending on the output options) that idvar uniquely identify observations. Under the the varprefix() option, it is essential that the contents of idvar be acceptable as suffixes on variable names; avoid embedded spaces and characters that are not acceptable in variable names.
-------------------------------------------------------------------- technical notes:
With the varprefix() option, illegal characters, embedded spaces, or non-unique values may result in a fatal error.
With the genmat() option, embedded spaces may cause the row or column names to "slip over" to the wrong row or column. Non-unique values do not cause an immediate error, but may cause confusing labelling of the columns and rows and may cause errors in later use of the matrix.
With the genfile() option, the form of the values in idvar is not critical, but if they don't uniquely identify observations, then it will be difficult to use the resulting file. --------------------------------------------------------------------
idvar() is optional. If it is omitted then the following values are used: - with the varprefix() option, 1, 2, etc., that is, the variable names are varprefix1, varprefix2, etc; - with the genmat() option, obs1, obs2, etc. as row and column names; - with the genfile() option, 1, 2, etc. But note that these numbers refer to the observations in the present order and become meaningless after a sort. Thus, the idvar() provides a more secure way of identifying the results.
The varprefix(), genmat(), and genfile() options are nonexclusive alternatives for obtaining the output; at least one of them must be used.
varprefix(varprefix) specifies that the results will be placed in a set of new variables, one for each observation. These variables will be named with a common prefix varprefix, and the remainder of the names are the values in idvar, or the observation numbers if idvar() is omitted. See the notes regarding acceptable content for idvar, above. Note that this option can generate a potentially very large set of variables - as many as there are observations (thus, constituting a square array of values), though that may set be reduced under the treated() option. (See remarks under treated() for more on that matter.) The default type for these variables is double.
genmat(genmat) specifies that the results will be placed in a matrix named genmat. The row and column names will be taken from the values in idvar, or will be obs1, obs2, etc., if idvar() is omitted. See the notes regarding acceptable content for idvar, above. If genmat already exists as a matrix, it will be overwritten. See additional remarks under treated() regarding which rows and columns will be included.
genmat() potentially creates a very large matrix. You may need to set matsize to a large value to enable this matrix to be created.
genfile(filename) specifies that the results will be placed in a separate dataset in long form. See reshape for an explanation of long form.
Under the genfile() option, the resulting file is a Stata dataset with these variables:
A primary and secondary id variable which refer to observations in the dataset from which the measures were derived.
A variable to hold the distance measure, measured between the observations identified in the primary and secondary id variables. The default name is _score, and its default type is double.
The types, content, and default names of the primary and secondary id variables depend on whether idvar() is specified:
If idvar() is specified, then these variables are of the same type as idvar, and contain values from idvar corresponding to the pertinant observations. Their default names are _refid and idvar.
If idvar() is omitted, then they are integer types, and contain the corresponding observation numbers. Their default names are _refobs and _obs.
Note that each of the three output options has two distinct entities that locate a distance measure value. We will identify one as primary and the other as secondary. The primary entities are...
for varprefix(), the variables generated; for genmat(), the rows of the matrix; for genfile(), the primary id variable.
The secondary entities are...
for varprefix(), the observations of the dataset (with values placed in the generated variables); for genmat(), the columns of the matrix; for genfile(), the secondary id variable.
Thus, the distance measure represents a difference measured from from the observation identified by the primary entity to the observation identified by the secondary entity; the distance in the other direction is the same. Consequently, the distinction between the primary and secondary entities often becomes immaterial, due to the symmetry of the situation. However, there is a situation where we choose to make a distinction, and the resulting set of values is asymmetric. In particular, this occurs under the treated() option, which will be described below.
Options for use with genfile
name1(name1) allows you to specify the name for the primary id variable. The default name depends on whether idvar() is specified, as explained above.
name2(name2) allows you to specify the name for the secondary id variable. The default name depends on whether idvar() is specified, as explained above.
scorevar(scorevar) allows you to specify the name of the distance measure variable. The default name is _score.
replace specifies that if the file already exists, it will be replaced.
More Options
invcovarmat(invcovarmat) specifies the name of a matrix to be used in the computation of the distance measure. It is presumably the inverse covariance matrix of varlist, but the only requirement is that it be a square p-by-p matrix, and both the row and column names must equal the names in varlist in the same order as in varlist.
invcovarmat(invcovarmat) is expected to be rarely used; it is provided in case the user wishes to supply an existing inverse covariance matrix, or one computed in some special way not provided for by the available options. Additionally, it might enable some efficiency advantage if repeated calls are made requiring the same inverse covariance matrix. For most usages, however, you probably want the compute_invcovarmat option
compute_invcovarmat specifies that you want the inverse covariance matrix to be computed, rather than passed in (via invcovarmat()). This computation is subject to weighting, as well as limitation by treatedvar if the treated() option is specified. (But see nocovtrlimitation.) Note that this will call covariancemat, which computes covariances limited to observations with all variables of varlist nonmissing. (I.e., it is potentially different from the pairwise computation of covariances.)
invcovarmat() and compute_invcovarmat are alternatives; one of them must be specified. If both are specified, then compute_invcovarmat takes precedence.
treated(treatedvar) specifies a numeric variable that distinguishes the "treated" observations, with values of 0 and non-zero signifying non-treated and treated, respectively. See mahapick for an explanation of the concept of the treated set. This option affects the action of the compute_invcovarmat in that the computation is limited to the set of observations for which treatedvar is non-zero, if treated() is specified. See nocovtrlimitation for how to control that limitation.
treated() also potentially limits the set of values that are output. In generic terms, the default action is that primary entities are associated with (limited to) the treated observations, and the secondary entities are associated with (limited to) the non-treated observations. (One exception: the secondary entities of the varprefix() option - the placing of values in the generated variables - are never limited in this way.) More specifically,
With varprefix(), only the variables corresponding to the treated observations will be generated.
With genmat(), only the rows corresponding to treated cases are generated; only the columns corresponding to non-treated cases are generated.
With genfile, only observations with primary id corresponding to treated cases, and with secondary id corresponding to non-treated cases are generated.
The rationale is that, with treated(), you would only be interested in distance measurments from a treated observation to a non-treated. (And these limitations save space as well.)
The all option lifts these limitations entirely; both the primary and secondary entities will range over all observations, yielding a square symmetric result.
The full option lifts the restriction on the secondary entities; all possible secondary id values or matrix columns are generated. (It has no effect on the varprefix() results, as the secondary entities for varprefix() are never limited by treated().)
In other words, the variables generated, the rows of the matrix, or the primary id variable, correspond to... the treated observations, if treated() is specified without all; all observations, otherwise. the colums of the matrix, or the secondary id variable correspond to... the non-treated observations, if treated() is specified without all or full; all observations, otherwise.
Note that all implies full; there is no provision for generating all primary id values (or matrix rows) without also getting all secondary id values (or matrix columns).
unsquared modifies the results to be the unsquared values, that is, the square roots of the default values.
euclidean takes effect only if compute_invcovarmat is also specified. This specifies that the normalized Euclidean measure is to be used, rather than the true Mahalanobis measure - meaning that the off-diagonal elements of the covariance matrix are replaced with zeroes prior to inverting. The result is a measure that accounts for the scale of measurement in each variable of varlist, but ignores correlation between the variables. This is probably not desirable, given the advantages of the true Mahalanobis measure, but is provided as an alternative and for comparison to (or emulation of) earlier releases of mahascore and mahapick. See mahascore for more details on this matter.
float specifies that the type for the variables generated by varprefix() or for the distance measure (or scorevar) generated by genfile() will be float, rather than double. This has no effect on genmat(), as matrices always contain doubles.
display(display_options) turns on the display of certain data structures used in the computation. If display_options contains covar, then the covariance matrix is listed; if it contains invcov, then the inverse covariance matrix is listed. Any other content is ignored.
verbose takes effect only if compute_invcovarmat is also specified. This causes each call to mahascore to be reported, along with information about what options were specified.
transpose specifies that the matrix (under genmat()) is to be transposed.
nocovtrlimitation specifies that the covariance computation (for compute_invcovarmat) not be limited to treated observations.
Remarks
If the inverse covariance matrix is computed on a very small set of observations, it may not be valid and may yield strange results. It might fail to be positive semi-definite, and can yield negative measures. (It may also cause the unsquared option to have a real effect on comparisons and sortings of the results.)
Please see mahascore for more information on the computation of the Mahalanobis measure.
Examples
. mahascores income age numkids edlevel, idvar(persno) varprefix(d1_) treated(assisted) compute_invcov
. mahascores income age numkids edlevel, idvar(persno) genmat(m1) treated(assisted) compute_invcov
. mahascores income age numkids edlevel, idvar(persno) genfile(dist1) compute_invcov scorevar(d1)
Acknowledgement The author wishes to thank Heiko Giebler of Wissenschaftszentrum Berlin fur Sozialforschung GmbH, for suggestion leading to the development of this program.
Author David Kantor. Email kantor.d@att.net if you observe any problems.
Also See mahascore, mahascore2, mahapick, covariancemat, variancemat,