------------------------------------------------------------------------------- help for mahaselectunique -------------------------------------------------------------------------------
Given a file written by mahapick with the genfile option, select a set of uniqu > e matches for the treated set.
mahaselectunique, usefile(filename1) writefile(filename2) idvar(idvar) [ prime_id(prime_id) matchnum(matchnum) scorevar(scorevar) nmatch(#) seed(seedvalue) replace clear nostrict ]
Description
mahaselectunique takes a file in the form generated by the genfile option of mahapick, and performs a unique selection of matches for the treated set using a randomized priority order. The resulting selections are written to a separate file.
The file to be used, filename1, should be in the form as written by mahapick with the genfile() option. That is, it should have these variables: a "prime_id" variable that identifies treated cases; an "id" variable that identifies control cases matched to the treated cases; a "matchnum" variable, assumed to be int, enumerating the matches in closeness order; optionally, a "score" variable (float or double) - the distance measure.
In filename1, every treated case should be present at least once, with matchnum =0 (and idvar = prime_id, though this latter condition is not mandatory). This is not a proposed match; it just carries the information that the prime_id value is a treated case. (But see the nostrict option for more on this matter.) Additionally, there are zero or more cases for the potential matches, with matchnum = 1, 2, 3, etc., corresponding to the best match, the second best match, the third best match, and so on. These potential matches are unique within each treated case, but not necessarily unique across different treated cases. The purpose of mahaselectunique is to select among the potential matches such that they are unique across all treated cases. Furthermore, if more than one match per treated case is opted (#>1), the selected matches are unique across the entire set.
Required Options
usefile(filename1) specifies the file to read, as described above.
writefile(filename2) specifies the file to be written; it will have the same structure as filename1, but without the matchnum =0 cases. It will have exactly # cases for each value of prime_id, though some may have missing idvar and matchnum values.
idvar(idvar) specifies the name of the id variable, which holds the identifier values of the matched control cases in the two files..
Optional Options
prime_id(prime_id) allows you to specify the name for the prime_id variable, which holds the identifier values of the treated cases in the two files. The default name is _prime_id.
matchnum(matchnum) allows you to specify the name for the matchnum variable, which enumerates the matched control cases, presumably in increasing order of the distance measure value. The default name is _matchnum.
scorevar(scorevar) allows you to specify the name of the score variable, containing the distance measure. The default name is _score. See additional notes under Remarks.
-------------------------------------------------------------------- note: The four variables specified by idvar(idvar), prime_id(prime_id) matchnum(matchnum), and scorevar(scorevar) will be sought in filename1 and written into filename2. They correspond to the same-named options in mahapick, assuming that filename1 was written by mahapick.
scorevar(scorevar) is optionally present in filename1; if it is absent, just omit this option. If you do not specify this option, and there is a variable named _score in filename1, it will automatically be carried into filename2.
The variable names specified by prime_id(prime_id) matchnum(matchnum), and scorevar(scorevar) have the same default values as in mahapick. Thus, you need to specify them only if you specified them in mahapick with non-default values. --------------------------------------------------------------------
nmatch(#) specifies how many matches per treated case to gather. The default is 1. See additional notes under Remarks.
seed(seedvalue) specifies a seed value for the random-selection process. This can be useful if you want to be able to exactly replicate the process; it also can be included in the documentiation of your data-preparation processes. The value provided can be anything that is acceptible to the set seed command. This includes integers in the range of 1..(2^32-1), and "code" values - that which is returned by the c(seed) function, representing the internal state of the random number generator. The latter appear to be strings of the form "X" (upper case X) followed by 36 hex digits, though possibly not all such strings are accepted.
replace specified that if filename2 already exists, it will be overwritten.
clear specifies that it is okay to replace the data in memory, even though the current data have not been saved to disk. This program will use filename1, so the in-memory data needs to be either empty (cleared) or "safe" to replace (saved to disk); otherwise, you will need to specify clear.
nostrict relaxes the requirement that every teated case be present in filename1 with one record having matchnum=0. By default, the set of treated cases is taken from the prime_id values in records having matchnum==0. If there are any treated cases with only the matchnum>0 cases present, then such cases will be omitted (but you will be shown a list of their prime_id values). The nostrict option relaxes this behavior; the set of treated cases will then be the set of all prime_id values present in filename1. If filename1 was produced by mahapick, then this option should not be needed.
Remarks
See mahapick for an explanataion of the distance measure, the matching process, and the concept of the treated cases. mahapick will obtain a set of potential match candidates for each of the treated cases, but they may not be unique. (They should be unique for any given treated case, but not necessarily unique across different treated cases.) mahaselectunique will attempt to reduce that to a set of unique matches. The selection process begins by marking all potential matches as available. The treated cases are visited in a random order. For each treated case, the best available matching control case is chosen, and all other potential matches (for all treated cases) with the same value of idvar are marked as unavailable. (The "best available matching control case" is the one with the lowest matchnum value among those that are available.) If nmatch(#) is specified as greater than 1, then additional passes through the treated cases are performed; each pass through this process is given a different random order. (Thus, the second and subsequent selections for a given treated case are not made in the same pass; to do so would give the early-chosen treated cases an unfair advantage on all of their selections.)
filename2 will not be sorted as you might expect. It is written in the order that selections were made, that is, randomly. But if you sort on prime_id and matchnum, it will make more sense. It may have cases with matchnum=-1, signifying that filename1 had no potential matches for the given value of prime_id; it may also have cases with matchnum=., signifying that no more potential matches were available. (Note that filename1 should have no instances of matchnum being negative or missing.)
If scorevar is included, it is not used in determining the closeness of the match; it is only carried along into filename2. Presumably, mahapick already did the job of setting matchnum based on scorevar.
The value specified under nmatch(#) should be significantly less than that specified in nummatches(#) in mahapick; conversely, the nummatches value needs to significantly greater than the nmatch value. nummatches(#) is the size of the available pool of match candidates for each treated person. Because there may be replication in these candidates, it needs to be larger than nmatch(#), so that every treated case can get at least that many unique matches. How much larger depends on the degree of replication among the close matches, which is dependent on the data as well as the choice of covariates used in mahapick. Some experimentation in the nummatches value may be needed to get it sufficiently large. Adjusting of the covariate set may also be useful, as will be explained.
If you are not getting enough matches selected in filename2 it could mean either that nummatches in mahapick needs to be increased, or that the covariate set needs to be adjusted. Putting aside which of these is the issue, let's examine what happens when you have many possible control cases in filename1, but you are not getting very many matches showing up in filename2. In this situation, although there are many control cases potentially matchable to the treated cases, there are not enough unique cases (distinct values of idvar) among them; there are many duplicates. Another characterization of this situation is that the best nmatch(#) matches for the treated cases are concentrated among a too-small portion of the potential control pool - a portion significantly smaller than nmatch(#)* num_treated_cases / num_potential_control_cases.
It is worth paying attention to the matchnum values in filename2. These indicate the quality of the choice made in the selection process - in terms of the rank of the choice's closeness to the treated case. Suppose, for example, that a given treated case has 8 as as its first (or only) matchnum, this means it got the 8th closest match, which means that by the time its match was selected, the first 7 best matches were already taken. If there is no match at all, that means that all the available potential matches (the best nummatches) were already taken. (It may have been destined to get, say, the 12th best match, but you only had 10 available.) So, if you are not getting enough matches, increasing nummatches may help. But there are reasons that point to adjusting the covariate set as well.
If you are not getting enough matches in filename2, or if you do get enough, but matchnum values are high, then this means that, among the potential control cases, a large portion are equally or nearly equally close to the treated cases in terms of the distance measures. It also may signify that many of the treated cases are close to each other, and thus "attract" many of the same control cases. Whether that is good or not is for you to decide. But it may be that your set of covariates could be adjusted so as to spread out the distance values. Of course, the choice of covariates needs to be guided by analytical considerations, but it is worth noting that having at least one continuous measure (with a wide varitety of values) in the covariate set is helpful toward the goal of spreading out the distance measure. If on the other hand, there are no continuous variable - just a few dicotomous variables, then the distance measure will range over a small set of values. That, in turn, makes it likely that many pairs of cases will have the exact same distance measure, and consequently, many treated cases may have the same best match (and second best, etc) among the potential control cases. For example, if the covariate set contains just five dicotomous variables, the distance measure will have at most 32 distinct values
In the situation where the matchnum value is high, such as 8 in the example given above, it may be worth examining the spread of the distance measures in filename1; this would be done for the first n potential matches for each given treated case, where n is the maximal chosen matchnum in filename2 for the given treated case - or for the first n matches among all potential matches, where n is the maximal matchnum in filename2 overall. If they are close together, then a "poor" choice of a match may not be so bad after all; it may have been nearly as good as the preferred ones. If, on the other hand, they are spread out, then the "poor" choices are truly less desirable.
Conceivably, mahaselectunique could use a file written by mahascores, with the genfile and treated options, as well as name1 and name2, however the file would need to be sorted on name1 and scorevar, and have matchnum generated. (by name1 (scorevar): gen int matchnum=_n.) But you will need the nostrict option in mahaselectunique.
After running mahaselectunique, you will be left with reords from filename1 where matchnum >0, but sorted in a random order.
Example, including call to mahapick
. mahapick income age numkids, idvar(id0) genfile(match1) nummatches(12)treated(assisted) . mahaselectunique, idvar(id0) usefile(match1) writefile(match2) seed(14716)
Acknowledgement The author wishes to thank Paula Arce for the inspiration to write this program.
Author David Kantor. Email kantor.d@att.net if you observe any problems.
Also See mahapick, mahascore, mahascores, mahascore2, covariancemat, variancemat, screenmatches, stackids.