------------------------------------------------------------------------------- help for

mahapick-------------------------------------------------------------------------------

Select matching observations based on a Mahalanobis scoring

mahapickvarlist[weight], idvar(idvarname)treated(treatedvar)[pickids(pickidvars)genfile(filename)replaceprime_id(prime_id_var)matchnum(matchnum_var)nummatches(#)fullmatchon(matchonvars) sliceby(slicebyvars)clear fastscorescorevar(scorevarname)allunsquaredeuclideandisplay(display_options)floatnocovtrlimitation]

Description

mahapickseeks matching observations for a set of "treated" observations, using a Mahalanobis distance measure which it calculates.The "treated" observations are the ones for which you are seeking matches; the others, the non-treated, form the pool of potential matches (or "control" observations). (The use of the term "treated" comes from the study of medical treatments.) Both the treated and non-treated observations are expected to be present together in one dataset, currently in memory. The treated observations are identified by

treatedvar.For each treated observation, the closest matching non-treated observation(s) will be chosen, according to the calculated distance measure, and subject to the constraints of

matchon(matchonvars)if that option is used. The selection of matches is done independently for each treated observation; a given control abservation may appear as a match for more than one treated observation. (But, of course, matched control observations are unique within the set selected for any particular treated observation, if multiple matches are chosen.)--------------------------------------------------------------------

technical note:Choosing unique matches is a beyond the scope of whatmahapickwas designed for, and involves a multitude of complex issues. However, users can take the output ofmahapickand perform further processing to arrive at a uniquely-chosen set. See thescoreandalloptions for more remarks about this topic. See also mahascores.Users desiring a unique selection based on a randomization process should see mahaselectunique. --------------------------------------------------------------------

varlist(the "covariates") is a set of numeric variables on which to build the distance measure - the Mahalanobis score. For each pair of observations, the distance measure (or score) is the matrix product d'Xd, where d is a vector of differences in the set of variables, and X is the inverse of the covariance matrix ofvarlist. If i and j are indices of two observations, then d = (v1[i]-v1[j] \ v2[i]-v2[j] \ ... \ vn[i]-vn[j]), where v1 v2 ... vn are the variables ofvarlist.Thus, the score is the sum of all the possible products of pairs of elements of d, weighted by corresponding elements of X. See mahascore for a further explanation of this. Note that the result is the square of what is properly the Mahalanobis distance, but this distinction should have no effect on the selection of closest matches (lowest scores). The

unsquaredoption will cause the scores to be the proper unsquared values.The covariaces are computed on the treated observations only, also limited to the set of observations that have all elements of

varlistnon-missing. I.e., the computation of covariances uses case-wise deletion when encountering missing values; the resulting values are potentially different from pair-wise covariances. This may seem like a limitation but it is appropriate; any treated observation with a missing value in one or more elements ofvarlistwill get no matches, so it might as well be excluded at the outset. Seenocovtrlimitationfor how to override the limitation to treated observations.Weights are allowed, but affect only the computation of the covariances.

The variables of

varlistshould be of numeric significance - not categorical. Any categorical variables should be replaced by a set of indicator variables.

Required Options

idvar(idvarname)specifies an identifying variable. It can be of any type, but it must be a single variable. Thus, if the existing identifying scheme consists of multiple variables, you should find a way to combine them uniquely into a single variable.It is the user's responsibility to assure that

idvarnameuniquely identifies all observations, thus assuring a usable result.

treated(treatedvar)specifies a numeric variable that distinguishes the treated observations. Its values must be 0 or 1, where 1 indicates a treated observation.

Semi-required Options

pickids(pickidvars)andgenfile(filename)are two ways of preserving the results of the matching. You must use one or the other, or both.

pickids(pickidvars)specifies a set of one or more pre-existing variables to hold the id's of the matched observations. It/they must be of the same type asidvarname, and must be filled with missing values ("" for strings) unless theclearoption is specified.If

pickidvarsconsists of more than one variable, then the first will get the best match, the second will get the second best match, and so on.

genfile(filename)specifies a file into which to post the results.Note that

pickids(pickidvars)puts the results into wide form within the current dataset, whereasgenfile(filename)puts them into long form in a separate dataset. (See reshape for a discussion of wide- versus long-shaped data.)Another difference between these methods is that with

pickids(pickidvars), it is up to the user to subsequently save the dataset - (or use it directly after its creation), whereasgenfile(filename)writes the results to a separate file.If you create a (wide) dataset using

pickids(pickidvars), you can subsequently convert it to long form using stackids.--------------------------------------------------------------------

Technical note:pickidswas the original method provided;genfilewas a later addition, and is probably more useful. --------------------------------------------------------------------If

genfile(filename)is used, the resulting file is a Stata dataset with these variables:A "prime_id" variable of the same type as

idvarname. This holds the id of the treated observation for which matches are being found. The default name for this is _prime_id; it can be changed using theprime_idoption.

idvarname- the same name and type as inidvar(idvarname). This holds the ids of all observations - treated or matching control observations.A "matchnum" variable - an int to count up the series of matches for each treated observation. The default name is _matchnum; it can be changed using the

matchnumoption. This variable will range from 0 to#.Optionally, a "score" variable, if the

scoreoption is specified. This holds the score - the distance measure between the treated (prime_id) observation and the given control observation. See thescoreoption for more about this.Within this file, there will be, for each treated observation...

one observation representing the treated observation itself, with _matchnum=0,

idvarname=_prime_id (orprime_id_var), and _score (orscorevarname) =0 (ifscorewas specified); this is followed by...zero or more observations for the matches, with _matchnum=1, 2, ... ,

#, andidvarnameholding the id of the matched observations. The first will get the best match, the second will get the second best match, and so on.For each treated observation, _prime_id (or

prime_id_var) is a constant, equaling the id of the treated observation. Note thatidvarname= _prime_id for the observations where _matchnum=0.The notion of "best match" and "second best match", etc., is ambiguous when ties occur in the scoring. In this case, the present sort order determines the choices. See "identical scorings" under

Remarksfor more on this matter.

Optional Options

matchon(matchonvars)imposes a restriction on the matching process, such that matches will be made only to observations that completely agree with the treated observation on the values inmatchonvars. In other words, the dataset is logically partitioned into subsets, as determined by the values inmatchonvars, and matching will occur only within each partition. (matchonvarsmay not includetreatedvar.)It is best that the variables in

matchonvarstake a fairly small set of values; generally, only categorical variables are appropriate. The types may be numeric or string.Do not confuse

matchonvarswithvarlist.varlistis a set of variables on which you want the matches to be "close";matchonvarsare variables on which you require perfect agreement.Missing values (including the extended missing values .a .b, etc.) in

matchonvarsare regarded as distinct.

sliceby(slicebyvars)imposes the same kind of restriction as doesmatchon(), restricting the matching to stay within the subsets as determined by the values inslicebyvars. However,sliceby()achieves the effect by different means, dividing the dataset into subsets, running the matching process separately on each subset, and reuniting them afterwards. By contrast,matchon()(withoutsliceby()) merely limits the matches that are chosen.

sliceby(slicebyvars)may only be specified ifmatchon(matchonvars)is also specified, andslicebyvarsmust be a subset ofmatchonvars. Thus,matchonvarsgives the full set of variables on which the matches must completely agree;slicebyvarsspecifies which of those variables will be the basis for actual slicing of the dataset to achieve the effect. Of course,slicebyvarsmay equalmatchonvars, but there may be some advantage to not doing that, as will be explained shortly.

sliceby()can result in very significant speed improvements for large datasets. But, of course, it is appropriate only where such a partitioning is an existing requirement of the desired matching operation.

sliceby()achieves its speed advantage by reducing unnecessary sorting - at the expense of manipulating many intermediary files. If the slices are exceedingly fine, the work involved in slicing may overshadow the advantages gained. Thus, it may be better for the slices to be coarser than the matchon sets; i.e., usesliceby()to go part-way in dividing up the data, and usematchon()(with one or more additional variables) to complete the effect.Because

slicebyvarsis a subset ofmatchonvars, all remarks regrdingmatchonvarsapply toslicebyvars. In particular, they ought to be categorical, all types are allowed, and extended missing values are regarded as distinct.--------------------------------------------------------------------

Technical note:it was not functionally necessary to requireslicebyvarsto be a subset ofmatchonvars. But it makes for clearer syntax in that it reminds the user that slicing implicitly restricts the matching. That is, regardless of that requirement, the use ofsliceby(slicebyvars)implies the same effect as havingslicebyvarsamong thematchonvars.Also note that the requirement that

slicebyvarsbe a subset ofmatchonvarsimposes the opposite relation between the corresponding subsets of the data; the data subsets corresponding tomatchonvarsare subsets of those corresponding toslicebyvars.) --------------------------------------------------------------------Note that the covariance matrix and its inverse are precalculated on the whole set (of treated observations only), not on each slice or matchon set. Thus, the use of

matchon(matchonvars), with or withoutsliceby(slicebyvars), is not the same as if you were to runmahapickon each matchon set separately.

fastapplies only ifslicebyis specified. It causesmahapickto bypass the preserve and restore commands that surround the slicing operation, and thereby can save some time - at the expense of safety. Withoutfast, if you press the Break key during the processing of the slices, the original dataset will be restored (though any matches made during the processing and recorded usingpickids()will be lost). Withfast, if you press the Break key during the processing of the slices, you will be left with only the present slice.

nocovtrlimitationspecifies that the covariance computation not be limited to treated observations.

unsquaredmodifies the score values to be the unsquared values, that is, the square roots of the default values. As mentioned elsewhere, the choice of squared or unsquared values ought to have no effect on the selection of matches. Thus, this should only affect thegenfileoption.

euclideanspecifies that the normalized Euclidean measure is to be used, rather than the true Mahalanobis measure - meaning that the off-diagonal elements of the covariance matrix are replaced with zeroes prior to inverting. The result is a measure that accounts for the scale of measurement in each variable ofvarlist, but ignores correlation between the variables. This is probably not desirable, given the advantages of the true Mahalanobis measure, but is provided as an alternative and for comparison to (or emulation of) earlier releases of mahascore and mahapick. See notes underChange Historyas well as mahascore for more details on this matter.

display(display_options)turns on the display of certain data structures used in the computation. Ifdisplay_optionscontainscovar, then the covariance matrix is listed; if is containsinvcov, then the inverse covariance matrix is listed. Any other content is ignored.

Options for use withpickidsonly

clearindicates that ifpickidvarsare not all missing, then it is okay to go ahead and replace them with missing values at the start of the process.

Options for use withgenfileonly

replaceindicates that iffilenamealready exists, then it is okay to replace it.

prime_id(prime_id_var)allows you to specify the name for the prime_id variable. The default name is _prime_id.

matchnum(matchnum_var)allows you to specify the name for the matchnum variable. The default name is _matchnum.

nummatches(#)specifies how many matches to collect for each treated observation. The default is 1. Note that this corresponds to the number ofpickidvarsin thepickidsoption.

fullspecifies that if matches cannot be made, then observations with missing values inidvarnameare to be written so that there will always be#+1 observations (i.e.,#"matches") for each treated observation. Suppose that you specifynummatches(3), and that for a given treated observation, only one match can be found. Then by default, only two observations will be written: one for the treated observation, and one for the match. Iffullis specified, then two additional observations (with missing values inidvarname) will be written.

scorespecifies that the file will contain an additional variable, holding the computed distance measure between the treated observation and the control observation. The default name is _score, and its type is double.Note that to record all the distance measures between all treated observations and all other observations "in place" (using

pickids()) would require adding as many new variables as there are control observations, which may or may not be practical. Such a structure would be in wide form; thescoreoption captures that information, but puts it in long form, which may be more practical. See also the remarks about mahascores, below.One possible use for this option is to allow users to supplement the results with an algorithm for further refinement of the matchings, for example, to reduce a set of candidate matches to a smaller set of unique matches, while minimizing the sum of all distance measures in the selected observations.

--------------------------------------------------------------------

technical note:Implementing such an algorithm may be difficult in Stata; it may be necessary to export the results for use by a program written in a general-purpose programming language. On the other hand, it may be feasible to do it in Mata. --------------------------------------------------------------------

scorevar(scorevarname)allows you to specify the name of the score variable, if thescoreoption is used. The default name is _score.

allsignifies that all possible control observations will be included.allwithoutfullrendersnummatches(#)irrelevant, and is equivalent to specifyingnummatches(#), where#is at least as large as the maximal number of available control observations (within matchon groups, if specified).

allwithfullcauses#to be the miniumum number of control observation records written for each treated observation (possibly with some filled with missing values to fill out the quota), but there will be more control observations written if they are available.Note that with

all, the action ofmahapickprocess is not so much a selecting, but rather a scoring and ranking process. Also, the number of control matches written per treated observation can vary from one matchon group to another, ifmatchonwas specified.The intent of the

alloption is that it would be used withscore, by users who want to take the scores (of all potential pairings) and do their own selection algorithm. But if the user desires the score values, without the sorting or selecting of control observations, then it is recommended to use mahascores instead of mahapick. That provides a way to simply capture the score values for all pairs of observations (or possibly all treated-to-non-treated pairs), and should prove to be faster than mahapick.

floatspecifies that the type for the score variable generated bygenfile()will be float, rather than double.

RemarksIf any of these conditions occur, then the score will be missing, and no matches will be made for the given treated observation:

Any covariate (variable in

varlist) is missing in the treated observation.Any of the variances are missing or zero (this would affect the whole set). (You can automatically avoid this by the use of the

omitmiszeroption.)In addition, if any covariate is missing in a control observation, then that observation is excluded from consideration.

It may happen that no matchable control observations are found for a given treated observation, and no match will be assigned. More generally, there may be fewer than

#(or fewer than the number of variables inpickidvars) matchable control observations. For example, if you havenummatches(3)(or threepickidvars), and only two eligible matches are found for a given treated observation, then, only two matches will be recorded infilename(or only the first two of thepickidvarswill be assigned) for that observation.Any of these situations are unlikely to occur if the pool of control observations is large - interpreted within each matchon group if

matchon()is specified.There may be cases where identical scorings occur for several potential matches. In this case, the existing sort order is used for breaking ties, taking the earlier-placed observations first (using a stable sort). Consequently, repeated runs will yield identical results, even if ties exist, provided that the initial sort order is kept the same.

Identical scorings are less likely to occur if there are many variables in

varlist, or if these variables take on many different values. When identical scorings occur, they usually are the result of identical values invarlist- including cases wherevarlistis the same for the treated and control observations (for a score of 0).Note that, while the processing involves sorting, the dataset is returned to its original sort order unless

sliceby(slicebyvars)is specified, in which case, the order is that of a stable sort onslicebyvars.

mahapickis rather noisy in its displayed output.This calls mahascore and covariancemat, other programs by the same author.

It is up to the user to make use of the matches. Generally, you will want to merge some "content" data onto the resulting set for analysis. If you use

genfile()(orpickids(), followed bystackids()) your resulting set will be a "basis" in long form, with treated and matched observations together in the same dataset. You will subsequently want to merge content data on to this, presumably usingidvarnameas the matching variable. This is probably the most desirable form of the resulting data for analysis purposes.Presumably,

treatedvaris an important variable in the analysis, but it may not be present in the basis set, if constructed as described above. You can recover it by including it in the merge, or you can reconstruct it by identifying observations where _matchnum==0.If you have used

pickids()and are leaving the data in wide form, you would need to merge on the content data, once for the treated observation, and once for each pickid, with distinct variable names for the content data in each of these merges. Such a data structure may be cumbersome, but it has the one advantage of directly embodying the connection between treated and matched observations - in case that is important to your planned analysis. (For example, you can construct differences between the treated and matched observations.)--------------------------------------------------------------------

technical note:If you have usedpickids()(and notgenfile()), but find that you prefer the results in long form, you can either rerun the match process usinggenfile(), or convert the results to long form using stackids. The latter option may be convenient if the matching process takes a lot of time. (stackidsis similar to reshape and stack, but includes provisions to preserve the correspondence between the treated and the matched observations.) --------------------------------------------------------------------One useful way of using

mahapickis to take several more matches per treated observation than you actually expect to use. That is, you specify a largenummatches()value (or a large set ofpickidvars). For example, if you want three matches per treated observation, you might collect, say, eight matches per treated observation (specifyingnummatches(8)). Then in subsequent analyses, using some code to pre-screen your data, you take the first (best) three "good" matches - good in the sense that they have no missing values in variables needed in the analysis. (Those would be variables in the "content" data mentioned above, which are typicallynotamong those used in the matching (i.e.,varlist).) The advantage of this is that, rather than filtering for observations with non-missing values in the content data before the match, you do it at the time you analyze the data. In susbsequent analyses, you might adjust the set of variables involved, thereby potentially shifting the set of control observations to exclude. But, given this setup, you will not need to rerun the match. You can also have several analyses with different mixes of variables, each of which takes its own best set of matches. The program screenmatches does this screening for you (with the data in long form).If the inverse covariance matrix is computed on a very small set of observations, it may not be valid and may yield strange results. It might fail to be positive semi-definite, and can yield negative measures. (It may also cause the

unsquaredoption to have a real effect on the choice of matches.)As it stands presently, there are no [

ifexp] or [inrange] features provided. They were not deemed essential whenmahapickwas first created, but could be added if there is a demand for them.See mahaselectunique for a further discussion of issues relating to the formulation of the covariate set and the quality of the scoring, as well as how that relates to unique selection.

Examples

. mahapick income age numkids, idvar(id0) genfile(myfile)nummatches(8)fulltreated(assisted). mahapick income age numkids, idvar(id0) genfile(myfile)nummatches(8)fulltreated(assisted)matchon(sex region) sliceby(region). mahapick income age numkids, idvar(id0) pickids(id1 id2 id3)treated(assisted). mahapick income age numkids, idvar(id0) pickids(id1 id2 id3)treated(assisted)matchon(sex region). mahapick income age numkids, idvar(id0) pickids(id1 id2 id3)treated(assisted)matchon(sex region) sliceby(region)

Change HistoryThe 1Apr2008 release implements the full Mahalanobis measure. Prior to that release, the normalized Euclidean measure was used, which is equivalent to the current version under the

euclideanoption. Referring to the d vector mentioned under the description ofvarlist, the normalized Euclidean measure is the sum of the squares of the components of d, weighted by the inverse variance of each variable.The 1Apr2008 release eliminated the

commonandomitmiszeroptions, which were deemed as inappropriate for the changes to the program. Note thatcommonwas to limit variance computations to the set of common observations that have no missing values invarlist; the present method (for covariances) always imposes that limitation.The 1Apr2008 release added these options:

unsquared,euclidean,float,display(), andnocovtrlimitation.

The author wishes to thank Joseph Harkness, formerly of The Institute for Policy Studies at Johns Hopkins University for guidance in developing this program, as well as Heiko Giebler of Wissenschaftszentrum Berlin fur Sozialforschung GmbH, for suggesting further improvements.Acknowledgement

David Kantor; initial development was done at The Institute for Policy Studies, Johns Hopkins University. Email kantor.d@att.net if you observe any problems.Author

mahascore, mahascores, mahascore2, covariancemat, variancemat,Also See