------------------------------------------------------------------------------- help for mahapick -------------------------------------------------------------------------------

Select matching observations based on a Mahalanobis scoring

mahapick varlist [weight] , idvar(idvarname) treated(treatedvar) [ pickids(pickidvars) genfile(filename) replace prime_id(prime_id_var) matchnum(matchnum_var) nummatches(#) full matchon(matchonvars) sliceby(slicebyvars) clear fast score scorevar(scorevarname) all unsquared euclidean display(display_options) float nocovtrlimitation ]


mahapick seeks matching observations for a set of "treated" observations, using a Mahalanobis distance measure which it calculates.

The "treated" observations are the ones for which you are seeking matches; the others, the non-treated, form the pool of potential matches (or "control" observations). (The use of the term "treated" comes from the study of medical treatments.) Both the treated and non-treated observations are expected to be present together in one dataset, currently in memory. The treated observations are identified by treatedvar.

For each treated observation, the closest matching non-treated observation(s) will be chosen, according to the calculated distance measure, and subject to the constraints of matchon(matchonvars) if that option is used. The selection of matches is done independently for each treated observation; a given control abservation may appear as a match for more than one treated observation. (But, of course, matched control observations are unique within the set selected for any particular treated observation, if multiple matches are chosen.)

-------------------------------------------------------------------- technical note: Choosing unique matches is a beyond the scope of what mahapick was designed for, and involves a multitude of complex issues. However, users can take the output of mahapick and perform further processing to arrive at a uniquely-chosen set. See the score and all options for more remarks about this topic. See also mahascores.

Users desiring a unique selection based on a randomization process should see mahaselectunique. --------------------------------------------------------------------

varlist (the "covariates") is a set of numeric variables on which to build the distance measure - the Mahalanobis score. For each pair of observations, the distance measure (or score) is the matrix product d'Xd, where d is a vector of differences in the set of variables, and X is the inverse of the covariance matrix of varlist. If i and j are indices of two observations, then d = (v1[i]-v1[j] \ v2[i]-v2[j] \ ... \ vn[i]-vn[j]), where v1 v2 ... vn are the variables of varlist.

Thus, the score is the sum of all the possible products of pairs of elements of d, weighted by corresponding elements of X. See mahascore for a further explanation of this. Note that the result is the square of what is properly the Mahalanobis distance, but this distinction should have no effect on the selection of closest matches (lowest scores). The unsquared option will cause the scores to be the proper unsquared values.

The covariaces are computed on the treated observations only, also limited to the set of observations that have all elements of varlist non-missing. I.e., the computation of covariances uses case-wise deletion when encountering missing values; the resulting values are potentially different from pair-wise covariances. This may seem like a limitation but it is appropriate; any treated observation with a missing value in one or more elements of varlist will get no matches, so it might as well be excluded at the outset. See nocovtrlimitation for how to override the limitation to treated observations.

Weights are allowed, but affect only the computation of the covariances.

The variables of varlist should be of numeric significance - not categorical. Any categorical variables should be replaced by a set of indicator variables.

Required Options

idvar(idvarname) specifies an identifying variable. It can be of any type, but it must be a single variable. Thus, if the existing identifying scheme consists of multiple variables, you should find a way to combine them uniquely into a single variable.

It is the user's responsibility to assure that idvarname uniquely identifies all observations, thus assuring a usable result.

treated(treatedvar) specifies a numeric variable that distinguishes the treated observations. Its values must be 0 or 1, where 1 indicates a treated observation.

Semi-required Options

pickids(pickidvars) and genfile(filename) are two ways of preserving the results of the matching. You must use one or the other, or both.

pickids(pickidvars) specifies a set of one or more pre-existing variables to hold the id's of the matched observations. It/they must be of the same type as idvarname, and must be filled with missing values ("" for strings) unless the clear option is specified.

If pickidvars consists of more than one variable, then the first will get the best match, the second will get the second best match, and so on.

genfile(filename) specifies a file into which to post the results.

Note that pickids(pickidvars) puts the results into wide form within the current dataset, whereas genfile(filename) puts them into long form in a separate dataset. (See reshape for a discussion of wide- versus long-shaped data.)

Another difference between these methods is that with pickids(pickidvars), it is up to the user to subsequently save the dataset - (or use it directly after its creation), whereas genfile(filename) writes the results to a separate file.

If you create a (wide) dataset using pickids(pickidvars), you can subsequently convert it to long form using stackids.

-------------------------------------------------------------------- Technical note: pickids was the original method provided; genfile was a later addition, and is probably more useful. --------------------------------------------------------------------

If genfile(filename) is used, the resulting file is a Stata dataset with these variables:

A "prime_id" variable of the same type as idvarname. This holds the id of the treated observation for which matches are being found. The default name for this is _prime_id; it can be changed using the prime_id option.

idvarname - the same name and type as in idvar(idvarname). This holds the ids of all observations - treated or matching control observations.

A "matchnum" variable - an int to count up the series of matches for each treated observation. The default name is _matchnum; it can be changed using the matchnum option. This variable will range from 0 to #.

Optionally, a "score" variable, if the score option is specified. This holds the score - the distance measure between the treated (prime_id) observation and the given control observation. See the score option for more about this.

Within this file, there will be, for each treated observation...

one observation representing the treated observation itself, with _matchnum=0, idvarname=_prime_id (or prime_id_var), and _score (or scorevarname) =0 (if score was specified); this is followed by...

zero or more observations for the matches, with _matchnum=1, 2, ... , #, and idvarname holding the id of the matched observations. The first will get the best match, the second will get the second best match, and so on.

For each treated observation, _prime_id (or prime_id_var) is a constant, equaling the id of the treated observation. Note that idvarname = _prime_id for the observations where _matchnum=0.

The notion of "best match" and "second best match", etc., is ambiguous when ties occur in the scoring. In this case, the present sort order determines the choices. See "identical scorings" under Remarks for more on this matter.

Optional Options

matchon(matchonvars) imposes a restriction on the matching process, such that matches will be made only to observations that completely agree with the treated observation on the values in matchonvars. In other words, the dataset is logically partitioned into subsets, as determined by the values in matchonvars, and matching will occur only within each partition. (matchonvars may not include treatedvar.)

It is best that the variables in matchonvars take a fairly small set of values; generally, only categorical variables are appropriate. The types may be numeric or string.

Do not confuse matchonvars with varlist. varlist is a set of variables on which you want the matches to be "close"; matchonvars are variables on which you require perfect agreement.

Missing values (including the extended missing values .a .b, etc.) in matchonvars are regarded as distinct.

sliceby(slicebyvars) imposes the same kind of restriction as does matchon(), restricting the matching to stay within the subsets as determined by the values in slicebyvars. However, sliceby() achieves the effect by different means, dividing the dataset into subsets, running the matching process separately on each subset, and reuniting them afterwards. By contrast, matchon() (without sliceby()) merely limits the matches that are chosen.

sliceby(slicebyvars) may only be specified if matchon(matchonvars) is also specified, and slicebyvars must be a subset of matchonvars. Thus, matchonvars gives the full set of variables on which the matches must completely agree; slicebyvars specifies which of those variables will be the basis for actual slicing of the dataset to achieve the effect. Of course, slicebyvars may equal matchonvars, but there may be some advantage to not doing that, as will be explained shortly.

sliceby() can result in very significant speed improvements for large datasets. But, of course, it is appropriate only where such a partitioning is an existing requirement of the desired matching operation.

sliceby() achieves its speed advantage by reducing unnecessary sorting - at the expense of manipulating many intermediary files. If the slices are exceedingly fine, the work involved in slicing may overshadow the advantages gained. Thus, it may be better for the slices to be coarser than the matchon sets; i.e., use sliceby() to go part-way in dividing up the data, and use matchon() (with one or more additional variables) to complete the effect.

Because slicebyvars is a subset of matchonvars, all remarks regrding matchonvars apply to slicebyvars. In particular, they ought to be categorical, all types are allowed, and extended missing values are regarded as distinct.

-------------------------------------------------------------------- Technical note: it was not functionally necessary to require slicebyvars to be a subset of matchonvars. But it makes for clearer syntax in that it reminds the user that slicing implicitly restricts the matching. That is, regardless of that requirement, the use of sliceby(slicebyvars) implies the same effect as having slicebyvars among the matchonvars.

Also note that the requirement that slicebyvars be a subset of matchonvars imposes the opposite relation between the corresponding subsets of the data; the data subsets corresponding to matchonvars are subsets of those corresponding to slicebyvars.) --------------------------------------------------------------------

Note that the covariance matrix and its inverse are precalculated on the whole set (of treated observations only), not on each slice or matchon set. Thus, the use of matchon(matchonvars), with or without sliceby(slicebyvars), is not the same as if you were to run mahapick on each matchon set separately.

fast applies only if sliceby is specified. It causes mahapick to bypass the preserve and restore commands that surround the slicing operation, and thereby can save some time - at the expense of safety. Without fast, if you press the Break key during the processing of the slices, the original dataset will be restored (though any matches made during the processing and recorded using pickids() will be lost). With fast, if you press the Break key during the processing of the slices, you will be left with only the present slice.

nocovtrlimitation specifies that the covariance computation not be limited to treated observations.

unsquared modifies the score values to be the unsquared values, that is, the square roots of the default values. As mentioned elsewhere, the choice of squared or unsquared values ought to have no effect on the selection of matches. Thus, this should only affect the genfile option.

euclidean specifies that the normalized Euclidean measure is to be used, rather than the true Mahalanobis measure - meaning that the off-diagonal elements of the covariance matrix are replaced with zeroes prior to inverting. The result is a measure that accounts for the scale of measurement in each variable of varlist, but ignores correlation between the variables. This is probably not desirable, given the advantages of the true Mahalanobis measure, but is provided as an alternative and for comparison to (or emulation of) earlier releases of mahascore and mahapick. See notes under Change History as well as mahascore for more details on this matter.

display(display_options) turns on the display of certain data structures used in the computation. If display_options contains covar, then the covariance matrix is listed; if is contains invcov, then the inverse covariance matrix is listed. Any other content is ignored.

Options for use with pickids only

clear indicates that if pickidvars are not all missing, then it is okay to go ahead and replace them with missing values at the start of the process.

Options for use with genfile only

replace indicates that if filename already exists, then it is okay to replace it.

prime_id(prime_id_var) allows you to specify the name for the prime_id variable. The default name is _prime_id.

matchnum(matchnum_var) allows you to specify the name for the matchnum variable. The default name is _matchnum.

nummatches(#) specifies how many matches to collect for each treated observation. The default is 1. Note that this corresponds to the number of pickidvars in the pickids option.

full specifies that if matches cannot be made, then observations with missing values in idvarname are to be written so that there will always be # +1 observations (i.e., # "matches") for each treated observation. Suppose that you specify nummatches(3), and that for a given treated observation, only one match can be found. Then by default, only two observations will be written: one for the treated observation, and one for the match. If full is specified, then two additional observations (with missing values in idvarname) will be written.

score specifies that the file will contain an additional variable, holding the computed distance measure between the treated observation and the control observation. The default name is _score, and its type is double.

Note that to record all the distance measures between all treated observations and all other observations "in place" (using pickids()) would require adding as many new variables as there are control observations, which may or may not be practical. Such a structure would be in wide form; the score option captures that information, but puts it in long form, which may be more practical. See also the remarks about mahascores, below.

One possible use for this option is to allow users to supplement the results with an algorithm for further refinement of the matchings, for example, to reduce a set of candidate matches to a smaller set of unique matches, while minimizing the sum of all distance measures in the selected observations.

-------------------------------------------------------------------- technical note: Implementing such an algorithm may be difficult in Stata; it may be necessary to export the results for use by a program written in a general-purpose programming language. On the other hand, it may be feasible to do it in Mata. --------------------------------------------------------------------

scorevar(scorevarname) allows you to specify the name of the score variable, if the score option is used. The default name is _score.

all signifies that all possible control observations will be included. all without full renders nummatches(#) irrelevant, and is equivalent to specifying nummatches(#), where # is at least as large as the maximal number of available control observations (within matchon groups, if specified).

all with full causes # to be the miniumum number of control observation records written for each treated observation (possibly with some filled with missing values to fill out the quota), but there will be more control observations written if they are available.

Note that with all, the action of mahapick process is not so much a selecting, but rather a scoring and ranking process. Also, the number of control matches written per treated observation can vary from one matchon group to another, if matchon was specified.

The intent of the all option is that it would be used with score, by users who want to take the scores (of all potential pairings) and do their own selection algorithm. But if the user desires the score values, without the sorting or selecting of control observations, then it is recommended to use mahascores instead of mahapick. That provides a way to simply capture the score values for all pairs of observations (or possibly all treated-to-non-treated pairs), and should prove to be faster than mahapick.

float specifies that the type for the score variable generated by genfile() will be float, rather than double.


If any of these conditions occur, then the score will be missing, and no matches will be made for the given treated observation:

Any covariate (variable in varlist) is missing in the treated observation.

Any of the variances are missing or zero (this would affect the whole set). (You can automatically avoid this by the use of the omitmiszer option.)

In addition, if any covariate is missing in a control observation, then that observation is excluded from consideration.

It may happen that no matchable control observations are found for a given treated observation, and no match will be assigned. More generally, there may be fewer than # (or fewer than the number of variables in pickidvars) matchable control observations. For example, if you have nummatches(3) (or three pickidvars), and only two eligible matches are found for a given treated observation, then, only two matches will be recorded in filename (or only the first two of the pickidvars will be assigned) for that observation.

Any of these situations are unlikely to occur if the pool of control observations is large - interpreted within each matchon group if matchon() is specified.

There may be cases where identical scorings occur for several potential matches. In this case, the existing sort order is used for breaking ties, taking the earlier-placed observations first (using a stable sort). Consequently, repeated runs will yield identical results, even if ties exist, provided that the initial sort order is kept the same.

Identical scorings are less likely to occur if there are many variables in varlist, or if these variables take on many different values. When identical scorings occur, they usually are the result of identical values in varlist - including cases where varlist is the same for the treated and control observations (for a score of 0).

Note that, while the processing involves sorting, the dataset is returned to its original sort order unless sliceby(slicebyvars) is specified, in which case, the order is that of a stable sort on slicebyvars.

mahapick is rather noisy in its displayed output.

This calls mahascore and covariancemat, other programs by the same author.

It is up to the user to make use of the matches. Generally, you will want to merge some "content" data onto the resulting set for analysis. If you use genfile() (or pickids(), followed by stackids()) your resulting set will be a "basis" in long form, with treated and matched observations together in the same dataset. You will subsequently want to merge content data on to this, presumably using idvarname as the matching variable. This is probably the most desirable form of the resulting data for analysis purposes.

Presumably, treatedvar is an important variable in the analysis, but it may not be present in the basis set, if constructed as described above. You can recover it by including it in the merge, or you can reconstruct it by identifying observations where _matchnum==0.

If you have used pickids() and are leaving the data in wide form, you would need to merge on the content data, once for the treated observation, and once for each pickid, with distinct variable names for the content data in each of these merges. Such a data structure may be cumbersome, but it has the one advantage of directly embodying the connection between treated and matched observations - in case that is important to your planned analysis. (For example, you can construct differences between the treated and matched observations.)

-------------------------------------------------------------------- technical note: If you have used pickids() (and not genfile()), but find that you prefer the results in long form, you can either rerun the match process using genfile(), or convert the results to long form using stackids. The latter option may be convenient if the matching process takes a lot of time. (stackids is similar to reshape and stack, but includes provisions to preserve the correspondence between the treated and the matched observations.) --------------------------------------------------------------------

One useful way of using mahapick is to take several more matches per treated observation than you actually expect to use. That is, you specify a large nummatches() value (or a large set of pickidvars). For example, if you want three matches per treated observation, you might collect, say, eight matches per treated observation (specifying nummatches(8)). Then in subsequent analyses, using some code to pre-screen your data, you take the first (best) three "good" matches - good in the sense that they have no missing values in variables needed in the analysis. (Those would be variables in the "content" data mentioned above, which are typically not among those used in the matching (i.e., varlist).) The advantage of this is that, rather than filtering for observations with non-missing values in the content data before the match, you do it at the time you analyze the data. In susbsequent analyses, you might adjust the set of variables involved, thereby potentially shifting the set of control observations to exclude. But, given this setup, you will not need to rerun the match. You can also have several analyses with different mixes of variables, each of which takes its own best set of matches. The program screenmatches does this screening for you (with the data in long form).

If the inverse covariance matrix is computed on a very small set of observations, it may not be valid and may yield strange results. It might fail to be positive semi-definite, and can yield negative measures. (It may also cause the unsquared option to have a real effect on the choice of matches.)

As it stands presently, there are no [if exp] or [in range] features provided. They were not deemed essential when mahapick was first created, but could be added if there is a demand for them.

See mahaselectunique for a further discussion of issues relating to the formulation of the covariate set and the quality of the scoring, as well as how that relates to unique selection.


. mahapick income age numkids, idvar(id0) genfile(myfile) nummatches(8) full treated(assisted) . mahapick income age numkids, idvar(id0) genfile(myfile) nummatches(8) full treated(assisted) matchon(sex region) sliceby(region) . mahapick income age numkids, idvar(id0) pickids(id1 id2 id3) treated(assisted) . mahapick income age numkids, idvar(id0) pickids(id1 id2 id3) treated(assisted) matchon(sex region) . mahapick income age numkids, idvar(id0) pickids(id1 id2 id3) treated(assisted) matchon(sex region) sliceby(region)

Change History

The 1Apr2008 release implements the full Mahalanobis measure. Prior to that release, the normalized Euclidean measure was used, which is equivalent to the current version under the euclidean option. Referring to the d vector mentioned under the description of varlist, the normalized Euclidean measure is the sum of the squares of the components of d, weighted by the inverse variance of each variable.

The 1Apr2008 release eliminated the common and omitmiszer options, which were deemed as inappropriate for the changes to the program. Note that common was to limit variance computations to the set of common observations that have no missing values in varlist; the present method (for covariances) always imposes that limitation.

The 1Apr2008 release added these options: unsquared, euclidean, float, display(), and nocovtrlimitation.

Acknowledgement The author wishes to thank Joseph Harkness, formerly of The Institute for Policy Studies at Johns Hopkins University for guidance in developing this program, as well as Heiko Giebler of Wissenschaftszentrum Berlin fur Sozialforschung GmbH, for suggesting further improvements.

Author David Kantor; initial development was done at The Institute for Policy Studies, Johns Hopkins University. Email kantor.d@att.net if you observe any problems.

Also See mahascore, mahascores, mahascore2, covariancemat, variancemat,