{smcl}
{* 22mar2008, rev 1apr2008, 2012feb17; 2012nov12 (one erroneous char deleted)}
{hline}
help for {hi:mahascores}
{hline}
{title:Generate a set of Mahalanobis distance measures}
{p 8 17 2}
{cmd:mahascores}
{it:varlist} [{it:weight}] {cmd:,}
[
{cmd:idvar(}{it:idvar}{cmd:)}
{cmd:varprefix(}{it:varprefix}{cmd:)}
{cmd:genmat(}{it:genmat}{cmd:)}
{cmd:genfile(}{it:filename}{cmd:)}
{cmd:name1(}{it:name1}{cmd:)}
{cmd:name2(}{it:name2}{cmd:)}
{cmd:scorevar(}{it:scorevar}{cmd:)}
{cmd:replace}
{cmd:treated(}{it:treatedvar}{cmd:)}
{cmdab:invcov:armat(}{it:invcovarmat}{cmd:)}
{cmdab:compute:_invcovarmat}
{cmdab:disp:lay(}{it:display_options}{cmd:)}
{cmd: full all}
{cmdab:unsq:uared} {cmdab:eucl:idean} {cmdab:verb:ose} {cmd:float}
{cmdab:trans:pose}
{cmdab:nocovtrlim:itation}
]
{title:Description}
{p 4 4 2}
{cmd:mahascores} generates a Mahalanobis distance measure
between every pair of observations, or possibly between selected pairs of
observations (under the {cmd:treated()} option).
By default, the result is actually the square of the proper Mahalanobis
measure. You can use the {com:unsquared} option to give you the
unsquared value, but note that in most cases, the resulting values are used
in comparisons or sortings; the proportional magnitude is not significant,
so the squared values are just as good.
{p 4 4 2}
{it:varlist} (the "covariates") is a list of numeric variables on which to build
the distance measure.
These variables should be of numeric significance, not categorical; any
categorical variables should be replaced by a set of indicator variables.
{p 4 4 2}
See {help mahascore} for an explanation of the Mahalanobis measure.
{p 4 4 2}
Weights are allowed, but apply only under the {cmd:compute_invcovarmat} option.
{p 4 4 2}
There are three means of getting the output:{p_end}
{p 8 10 2}{c -} a set of generated variables, using the {cmd:varprefix()} option;{p_end}
{p 8 10 2}{c -} a matrix, using the {cmd:genmat()} option;{p_end}
{p 8 10 2}{c -} a separate file, using the {cmd:genfile()} option.
{title:Options}
{p 4 4 2}
In what follows, let {it:p} denote the number of variables in {it:varlist}.
{p 4 4 2}
{cmd:idvar(}{it:idvar}{cmd:)} is an identifying variable which is used to mark
the components of the output:{p_end}
{p 8 10 2}{c -} with the {cmd:varprefix()} option, its values become part
of the new variable names;{p_end}
{p 8 10 2}{c -} with the {cmd:genmat()} option, its values are used as
matrix row and column names;{p_end}
{p 8 10 2}{c -} with the {cmd:genfile()} option, its values go into the
primary and secondary identifying variables.
{p 4 4 2}
{it:idvar} can be of any type, but it must
be a single variable. If the existing identifying scheme consists of
multiple variables, you should devise a way to combine them uniquely into a
single variable. Numbers are acceptable, but they should be integers.
{p 4 4 2}
It is desirable and often essential (depending on the output
options) that {it:idvar} uniquely identify observations.
Under the the {cmd:varprefix()} option,
it is essential that the contents of {it:idvar} be acceptable as suffixes on
variable names; avoid embedded spaces and characters that are not acceptable
in variable names.
{col 12}{hline}
{p 12 12 12}
{hi:technical notes:}
{p 12 12 12}
With the {cmd:varprefix()} option, illegal characters, embedded spaces,
or non-unique values may result in a fatal error.
{p 12 12 12}
With the {cmd:genmat()} option, embedded spaces may cause the row or column
names to "slip over" to the wrong row or column. Non-unique values do not
cause an immediate error, but may cause confusing labelling of the columns
and rows and may cause errors in later use of the matrix.
{p 12 12 12}
With the {cmd:genfile()} option, the form of the values in {it:idvar} is not
critical, but if they don't uniquely identify observations, then it will be
difficult to use the resulting file.
{p_end}
{col 12}{hline}
{p 4 4 2}
{cmd:idvar()} is optional. If it is omitted then the following
values are used:{p_end}
{p 8 10 2}{c -} with the {cmd:varprefix()} option, 1, 2, etc., that is, the
variable names are {it:varprefix}1, {it:varprefix}2, etc;{p_end}
{p 8 10 2}{c -} with the {cmd:genmat()} option, obs1, obs2, etc. as row and
column names;{p_end}
{p 8 10 2}{c -} with the {cmd:genfile()} option, 1, 2, etc.{p_end}
{p 4 4 2}But note that these numbers refer to the observations in the present
order and become meaningless after a {help sort}. Thus, the {cmd:idvar()}
provides a more secure way of identifying the results.
{p 4 4 2}
The {cmd:varprefix()}, {cmd:genmat()}, and {cmd:genfile()} options are
nonexclusive alternatives for obtaining the output; at least one of them
must be used.
{p 4 4 2}
{cmd:varprefix(}{it:varprefix}{cmd:)} specifies that the results will be
placed in a set of new variables, one for each observation.
These variables will be named with a common prefix {it:varprefix}, and
the remainder of the names are the values in {it:idvar}, or the observation
numbers if {cmd:idvar()} is omitted. See the notes regarding acceptable content
for {it:idvar}, above.
Note that this option can generate a potentially very large set of
variables {c -} as many as there are observations (thus, constituting a square
array of values), though that may set be reduced under the {cmd:treated()} option.
(See remarks under treated() for more on that matter.)
The default type for these variables is double.
{p 4 4 2}
{cmd:genmat(}{it:genmat}{cmd:)} specifies that the results will be placed in
a matrix named {it:genmat}. The row and column names will be taken from the
values in {it:idvar}, or will be obs1, obs2, etc., if {cmd:idvar()} is omitted.
See the notes regarding acceptable content for {it:idvar}, above.
If {it:genmat} already exists as a matrix, it will be
overwritten. See additional remarks under {cmd:treated()} regarding which
rows and columns will be included.
{p 4 4 2}
{cmd:genmat()} potentially creates a very large matrix. You may need to
{help set matsize} to a large value to enable this matrix to be created.
{p 4 4 2}
{cmd:genfile(}{it:filename}{cmd:)} specifies that the results will be placed in
a separate dataset in long form. See {help reshape} for an explanation of
long form.
{p 6 6 2}
Under the {cmd:genfile()} option, the resulting file is a Stata
dataset with these variables:
{p 10 12 2}
A primary and secondary id variable which refer to observations in the dataset
from which the measures were derived.
{p 10 12 2}
A variable to hold the distance measure, measured between the observations
identified in the primary and secondary id variables. The default name is
_score, and its default type is double.
{p 6 6 2}
The types, content, and default names of the primary and secondary id variables depend on
whether {cmd:idvar()} is specified:
{p 10 10 2}
If {cmd:idvar()} is specified, then these variables
are of the same type as {it:idvar}, and contain values from {it:idvar}
corresponding to the pertinant observations.
Their default names are _refid and {it:idvar}.
{p 10 10 2}
If {cmd:idvar()} is omitted, then
they are integer types, and contain the corresponding observation numbers.
Their default names are _refobs and _obs.
{p 4 4 2}
Note that each of the three output options has two distinct entities that locate a
distance measure value. We will identify one as primary and the other as
secondary. The primary entities are...
{p 6 6 2}
for {cmd:varprefix()}, the variables generated;{p_end}
{p 6 6 2}
for {cmd:genmat()}, the rows of the matrix;{p_end}
{p 6 6 2}
for {cmd:genfile()}, the primary id variable.
{p 4 4 2}
The secondary entities are...
{p 6 6 2}
for {cmd:varprefix()}, the observations of the dataset (with values placed
in the generated variables);{p_end}
{p 6 6 2}
for {cmd:genmat()}, the columns of the matrix;{p_end}
{p 6 6 2}
for {cmd:genfile()}, the secondary id variable.
{p 4 4 2}
Thus, the distance measure represents a difference measured from from the
observation identified by the primary entity to the observation identified by
the secondary entity; the distance in the other direction is the same.
Consequently, the distinction between the primary and secondary entities
often becomes immaterial, due to the symmetry of the situation.
However, there is a situation where we choose to make a distinction, and the
resulting set of values is asymmetric. In particular,
this occurs under the {cmd:treated()} option, which will be described below.
{title:Options for use with {cmd:genfile}}
{p 4 4 2}
{cmd:name1(}{it:name1}{cmd:)} allows you to specify the name for
the primary id variable. The default name depends on whether {cmd:idvar()}
is specified, as explained above.
{p 4 4 2}
{cmd:name2(}{it:name2}{cmd:)} allows you to specify the name for
the secondary id variable. The default name depends on whether {cmd:idvar()}
is specified, as explained above.
{p 4 4 2}
{cmd:scorevar(}{it:scorevar}{cmd:)} allows you to specify the name of
the distance measure variable. The default name is _score.
{p 4 4 2}
{cmd:replace} specifies that if the file already exists, it will be replaced.
{title:More Options}
{p 4 4 2}
{cmd:invcovarmat(}{it:invcovarmat}{cmd:)} specifies the name of a matrix
to be used in the computation of the distance measure. It is presumably the
inverse covariance matrix of {it:varlist}, but the only requirement is that
it be a square {it:p}-by-{it:p} matrix, and both the row and column names
must equal the names in {it:varlist} in the same order as in {it:varlist}.
{p 4 4 2}
{cmd:invcovarmat(}{it:invcovarmat}{cmd:)} is expected to be rarely used; it
is provided in case
the user wishes to supply an existing inverse covariance matrix, or one
computed in some special way not provided for by the available options.
Additionally, it might enable some efficiency advantage if repeated calls are
made requiring the same inverse covariance matrix. For most usages, however,
you probably want the {cmd:compute_invcovarmat} option
{p 4 4 2}
{cmd:compute_invcovarmat} specifies that you want the inverse covariance
matrix to be computed, rather than passed in (via {cmd:invcovarmat()}).
This computation is subject to weighting, as well as limitation by
{it:treatedvar} if the {cmd:treated()} option is specified.
(But see {cmd:nocovtrlimitation}.)
Note that this will call {help covariancemat}, which computes covariances
limited to observations with all variables of {it:varlist} nonmissing.
(I.e., it is potentially different from the pairwise computation of covariances.)
{p 4 4 2}
{cmd:invcovarmat()} and {cmd:compute_invcovarmat} are alternatives; one of
them must be specified. If both are specified, then {cmd:compute_invcovarmat}
takes precedence.
{p 4 4 2}
{cmd:treated(}{it:treatedvar}{cmd:)} specifies a numeric variable that
distinguishes the "treated" observations, with values of 0 and non-zero
signifying non-treated and treated, respectively. See {help mahapick} for an
explanation of the concept of the treated set.
This option affects the action of the {cmd:compute_invcovarmat} in that
the computation is limited to the
set of observations for which {it:treatedvar} is non-zero, if
{cmd:treated()} is specified.
See {cmd:nocovtrlimitation} for how to control that limitation.
{p 4 4 2}
{cmd:treated()} also potentially limits the set of values that are output.
In generic terms, the default action is that primary entities are associated
with (limited to) the treated observations,
and the secondary entities are associated with (limited to) the non-treated
observations. (One exception: the secondary entities of the {cmd:varprefix()}
option {c -} the placing of values in the generated variables {c -} are never
limited in this way.) More specifically,
{p 6 6 2}
With {cmd:varprefix()}, only the variables corresponding to the treated observations
will be generated.
{p 6 6 2}
With {cmd:genmat()}, only the rows corresponding to treated cases are generated;
only the columns corresponding to non-treated cases are generated.
{p 6 6 2}
With {cmd:genfile}, only observations with primary id corresponding to
treated cases, and with secondary id corresponding to non-treated cases are
generated.
{p 4 4 2}
The rationale is that, with {cmd:treated()}, you would only be interested in
distance measurments from a treated observation to a non-treated. (And these
limitations save space as well.)
{p 4 4 2}
The {cmd:all} option lifts these limitations entirely; both the primary and
secondary entities will range over all observations, yielding a
square symmetric result.
{p 4 4 2}
The {cmd:full} option lifts the restriction on the secondary entities; all
possible secondary id values or matrix columns are generated.
(It has no effect on the {cmd:varprefix()} results, as the secondary entities
for {cmd:varprefix()} are never limited by {cmd:treated()}.)
{p 4 4 2}
In other words,{p_end}
{p 10 10 2}
the variables generated, the rows of the matrix, or the primary id variable,
correspond to...{p_end}
{p 14 14 2}
the treated observations, if {cmd:treated()} is specified without {cmd:all};{p_end}
{p 14 14 2} all observations, otherwise.{p_end}
{p 10 10 2}
the colums of the matrix, or the secondary id variable correspond to...{p_end}
{p 14 14 2}
the non-treated observations, if {cmd:treated()} is specified without {cmd:all}
or {cmd:full};{p_end}
{p 14 14 2} all observations, otherwise.{p_end}
{p 4 4 2}
Note that {cmd:all} implies {cmd:full};
there is no provision for generating all primary id values (or matrix rows)
without also getting all secondary id values (or matrix columns).
{p 4 4 2}
{cmd:unsquared} modifies the results to be the unsquared values, that is, the
square roots of the default values.
{p 4 4 2}
{cmd:euclidean} takes effect only if {cmd:compute_invcovarmat} is also specified.
This specifies that the normalized Euclidean measure is to be used, rather
than the true Mahalanobis measure {c -} meaning that the off-diagonal elements
of the covariance matrix are replaced with zeroes prior to inverting. The result
is a measure that accounts for the scale of measurement in each variable of
{it:varlist}, but ignores correlation between the variables. This is probably
not desirable, given the advantages of the true Mahalanobis measure, but is
provided as an alternative and for comparison to (or emulation of) earlier
releases of {help mahascore} and {help mahapick}. See {help mahascore} for
more details on this matter.
{p 4 4 2}
{cmd:float} specifies that the type for the variables generated by
{cmd:varprefix()} or for the distance measure (or {it:scorevar}) generated
by {cmd:genfile()} will be float, rather than double. This has no effect on
{cmd:genmat()}, as matrices always contain doubles.
{p 4 4 2}
{cmd:display(}{it:display_options}{cmd:)} turns on the display of certain
data structures used in the computation. If {it:display_options} contains
{cmd:covar}, then the covariance matrix is listed;
if it contains {cmd:invcov}, then the inverse covariance matrix is listed.
Any other content is ignored.
{p 4 4 2}
{cmd:verbose} takes effect only if {cmd:compute_invcovarmat} is also specified.
This causes each call to {help mahascore} to be reported, along with
information about what options were specified.
{p 4 4 2}
{cmd:transpose} specifies that the matrix (under {cmd:genmat()}) is to be
transposed.
{p 4 4 2}
{cmd:nocovtrlimitation} specifies that the covariance computation
(for {cmd:compute_invcovarmat}) not be limited to treated observations.
{title:Remarks}
{p 4 4 2}
If the inverse covariance matrix is computed on a very small set of
observations, it may not be valid and may yield strange results. It
might fail to be positive semi-definite, and can yield negative measures.
(It may also cause the {cmd:unsquared} option to have a real effect on
comparisons and sortings of the results.)
{p 4 4 2}
Please see {help mahascore} for more information on the computation of
the Mahalanobis measure.
{title:Examples}
{p 4 8 2}
{cmd:. mahascores income age numkids edlevel, idvar(persno) varprefix(d1_)}
{cmd:treated(assisted) compute_invcov}
{p 4 8 2}
{cmd:. mahascores income age numkids edlevel, idvar(persno) genmat(m1)}
{cmd:treated(assisted) compute_invcov}
{p 4 8 2}
{cmd:. mahascores income age numkids edlevel, idvar(persno) genfile(dist1)}
{cmd:compute_invcov scorevar(d1)}
{title:Acknowledgement}
{p 4 4 2}
The author wishes to thank
Heiko Giebler of Wissenschaftszentrum Berlin fur Sozialforschung
GmbH, for suggestion leading to the development of this program.
{title:Author}
{p 4 4 2}
David Kantor.
Email {browse "mailto:kantor.d@att.net":kantor.d@att.net} if you observe any
problems.
{title:Also See}
{p 4 4 2}
{help mahascore}, {help mahascore2}, {help mahapick}, {help covariancemat}, {help variancemat},
{help screenmatches}, {help stackids}.