{smcl}
{* 2012oct17}
{hline}
help for {hi:mahaselectunique}
{hline}

{title:Given a file written by mahapick with the genfile option, select a set of unique matches for the treated set.}

{p 8 17 2}
{cmd:mahaselectunique,}
{cmd:usefile(}{it:filename1}{cmd:)}
{cmd:writefile(}{it:filename2}{cmd:)}
{cmd:idvar(}{it:idvar}{cmd:)}
[
{cmd:prime_id(}{it:prime_id}{cmd:)}
{cmd:matchnum(}{it:matchnum}{cmd:)}
{cmd:scorevar(}{it:scorevar}{cmd:)}
{cmd:nmatch(}{it:#}{cmd:)}
{cmd:seed(}{it:seedvalue}{cmd:)}
{cmd:replace}
{cmd:clear}
{cmd:nostrict}
]


{title:Description}

{p 4 4 2}
{cmd:mahaselectunique} takes a file in the form generated by the {cmd:genfile} option of {help mahapick}, and
performs a unique selection of matches for the treated set using a randomized priority order. The resulting selections are written
to a separate file.

{p 4 4 2}
The file to be used, {it:filename1}, should be in the form as written by {help mahapick} with
the {cmd:genfile()} option. That is, it should have these variables:{p_end}
{p 8 6 2} a "prime_id" variable that identifies treated cases;{p_end}
{p 8 6 2} an "id" variable that identifies control cases matched to the treated cases;{p_end}
{p 8 6 2} a "matchnum" variable, assumed to be int, enumerating the matches in closeness order;{p_end}
{p 8 6 2} optionally, a "score" variable (float or double) {c -} the distance measure.

{p 4 4 2}
In {it:filename1}, every treated case should be present at least once, with {it:matchnum} =0 (and {it:idvar} = {it:prime_id}, though this
latter condition is not mandatory). This is not a proposed match; it
just carries the information that the {it:prime_id} value is a treated case. (But see the {cmd:nostrict} option for more on this matter.)
Additionally, there are zero or more cases for the
potential matches, with {it:matchnum} = 1, 2, 3, etc., corresponding to the best match, the second best match, the third best match,
and so on. These potential matches are unique within each treated case, but not necessarily unique across different treated cases.
The purpose of {cmd:mahaselectunique} is to select among the potential matches such that they are unique across all treated cases.
Furthermore, if more than one match per treated case is opted ({it:#}>1), the selected matches are unique across the entire set.

{title:Required Options}

{p 4 4 2}
{cmd:usefile(}{it:filename1}{cmd:)} specifies the file to read, as described above.

{p 4 4 2}
{cmd:writefile(}{it:filename2}{cmd:)} specifies the file to be written; it will have the same structure as {it:filename1}, but without
the {it:matchnum} =0 cases. It will have exactly {it:#} cases for each value of {it:prime_id}, though some may have missing {it:idvar}
and {it:matchnum} values.

{p 4 4 2}
{cmd:idvar(}{it:idvar}{cmd:)} specifies the name of the id variable, which holds the identifier values of the matched 
control cases in the two files..

{title:Optional Options}

{p 4 4 2}
{cmd:prime_id(}{it:prime_id}{cmd:)} allows you to specify the name for
the prime_id variable, which holds the identifier values of the treated cases in the two files.
The default name is _prime_id.

{p 4 4 2}
{cmd:matchnum(}{it:matchnum}{cmd:)}
allows you to specify the name for
the matchnum variable, which enumerates the matched control cases, presumably in increasing order of
the distance measure value. The default name is _matchnum.

{p 4 4 2}
{cmd:scorevar(}{it:scorevar}{cmd:)} allows you to specify the name of
the score variable, containing the distance measure. The default name is _score. See additional notes under
{hi:Remarks}.

{col 12}{hline}
{p 12 12 12}
{hi:note:} The four variables specified by {cmd:idvar(}{it:idvar}{cmd:)}, {cmd:prime_id(}{it:prime_id}{cmd:)} 
{cmd:matchnum(}{it:matchnum}{cmd:)}, and {cmd:scorevar(}{it:scorevar}{cmd:)} will be sought in
{it:filename1} and written into {it:filename2}. They correspond to the same-named
options in {cmd:mahapick}, assuming that {it:filename1} was written by {cmd:mahapick}.

{p 12 12 12}
{cmd:scorevar(}{it:scorevar}{cmd:)} is optionally present in {it:filename1}; if it is absent, just omit
this option. If you do not specify this option, and there is a variable named _score in {it:filename1},
it will automatically be carried into {it:filename2}.
{p_end}

{p 12 12 12}
The variable names specified by {cmd:prime_id(}{it:prime_id}{cmd:)} 
{cmd:matchnum(}{it:matchnum}{cmd:)}, and {cmd:scorevar(}{it:scorevar}{cmd:)} have the same default values as in 
{cmd:mahapick}. Thus, you need to specify them only if you specified them in {cmd:mahapick} with non-default values.{p_end}
{col 12}{hline}

{p 4 4 2}
{cmd:nmatch(}{it:#}{cmd:)} specifies how many matches per treated case to gather. The default is 1.
See additional notes under {hi:Remarks}.

{p 4 4 2}
{cmd:seed(}{it:seedvalue}{cmd:)} specifies a seed value for the random-selection process. This can be useful if you want to
be able to exactly replicate the process; it also can be included in the documentiation of your data-preparation processes.
The value provided can be anything that is acceptible to the {help set seed} command. This includes integers in the range of
1..(2^32-1), and "code" values {c -} that which is returned by the c(seed) function, representing the internal state of the
random number generator. The latter appear to be strings of the
form "X" (upper case X) followed by 36 hex digits, though possibly not all such strings are accepted.

{p 4 4 2}
{cmd:replace} specified that if {it:filename2} already exists, it will be overwritten.

{p 4 4 2}
{cmd:clear} specifies that it is okay to replace the data in memory, even though the current data
have not been saved to disk. This program will
{help use} {it:filename1}, so the in-memory data needs to be either empty ({help clear}ed) or "safe" to replace ({help sav}ed to disk); otherwise, 
you will need to specify {cmd:clear}.

{p 4 4 2}
{cmd:nostrict} relaxes the requirement that every teated case be present in {it:filename1} with one record having {it:matchnum}=0.
By default, the set of treated
cases is taken from the {it:prime_id} values in records having {it:matchnum}==0. If there are any
treated cases with only the {it:matchnum}>0 cases present, then such cases will be omitted (but you will be shown a list of their
{it:prime_id} values). The {cmd:nostrict} option relaxes this behavior; the set of treated cases will then be
the set of all {it:prime_id} values present in {it:filename1}.
If {it:filename1} was produced by {cmd:mahapick}, then this option should not be needed. 

{title:Remarks}

{p 4 4 2}
See {help mahapick} for an explanataion of the distance measure, the matching process, and the concept of the treated cases.
{cmd:mahapick} will obtain a set of potential match candidates for each of the treated cases, but they may not be unique.
(They should be unique for any given treated case, but not necessarily unique across different treated cases.)
{cmd:mahaselectunique} will attempt to reduce that to a set of unique matches. 
The selection process begins by marking all potential matches as available. The treated cases are visited in a random order.
For each treated case, the best available 
matching control case is chosen, and all other potential matches (for all treated cases) with the same value of {it:idvar} are
marked as unavailable. (The "best available matching control case" is the one with the lowest {it:matchnum} value among those
that are available.)
If {cmd:nmatch(}{it:#}{cmd:)} is specified as greater than 1,
then additional passes through the treated cases are performed; each pass through this process is given a different random order.
(Thus, the second and subsequent selections for a given treated case are not
made in the same pass; to do so would give the early-chosen treated cases an unfair advantage on all of their selections.)

{p 4 4 2}
{it:filename2} will not be sorted as you might expect. It is written in the order that selections were made, that is, randomly. But if you 
sort on {it:prime_id} and {it:matchnum}, it will make more sense.
It may have cases with {it:matchnum}=-1, signifying that {it:filename1} had no potential matches for the given value of
{it:prime_id}; it may also have cases with {it:matchnum}=., signifying that no more potential matches were available.
(Note that {it:filename1} should have no instances of {it:matchnum} being negative or missing.)

{p 4 4 2}
If {it:scorevar} is included, it is not used in determining the closeness of the match; it is only carried along into {it:filename2}.
Presumably, {cmd:mahapick} already did the job of setting {it:matchnum} based on {it:scorevar}.

{p 4 4 2}
The value specified under {cmd:nmatch(}{it:#}{cmd:)} should be significantly less than that specified in {cmd:nummatches(}{it:#}{cmd:)} in 
{cmd:mahapick}; conversely, the {cmd:nummatches} value needs to significantly greater than the {cmd:nmatch} value.
{cmd:nummatches(}{it:#}{cmd:)} is the size of the available pool of match candidates for each treated person. Because there may be replication
in these candidates, it needs to be larger than {cmd:nmatch(}{it:#}{cmd:)}, so that every treated case can get at least that many
unique matches. How much larger depends on the degree of replication among the close matches, which is dependent on the data as well as
the choice of covariates used in {cmd:mahapick}. Some experimentation in the {cmd:nummatches} value may be needed to get it sufficiently 
large. Adjusting of the covariate set may also be useful, as will be explained.

{p 4 4 2}
If you are not getting enough matches selected in {it:filename2} it could mean either that {cmd:nummatches} in {cmd:mahapick} needs to be
increased, or that the covariate set needs to be adjusted. Putting aside which of these is the issue, let's examine what happens
when you have many possible control cases in {it:filename1}, but you are not getting very many matches showing up in {it:filename2}.
In this situation, although there are many control cases potentially matchable to the treated cases, there are not enough
unique cases (distinct values of {it:idvar}) among them; there are many duplicates. Another characterization of this situation is that 
the best {cmd:nmatch(}{it:#}{cmd:)}
matches for the treated cases are concentrated among a too-small portion of the potential control pool {c -} a portion significantly smaller than
{cmd:nmatch(}{it:#}{cmd:)}* num_treated_cases / num_potential_control_cases.

{p 4 4 2}
It is worth paying attention to the {it:matchnum} values in {it:filename2}. These indicate the quality of the choice made in the
selection process {c -} in terms of the rank of the choice's closeness to the treated case.
Suppose, for example, that a given treated case has 8 as as its first (or only) {it:matchnum}, this means it got 
the 8th closest match, which means that by the time its match was selected, the first 7 best matches were already taken. If there is no match at all, that 
means that all the available potential matches (the best {cmd:nummatches}) were already taken. (It may have been destined to get, say, the 
12th best match, but you only had 10 available.) So, if you are not getting enough matches, increasing {cmd:nummatches} may help. But there are
reasons that point to adjusting the covariate set as well.

{p 4 4 2}
If you are not getting enough matches in {it:filename2}, or if you do get enough, but {it:matchnum} values are high, then this
means that, among the potential control cases, a large portion are equally or nearly equally close to the treated cases in
terms of the distance measures. It also may signify that many of the treated cases are close to each other, and thus "attract"
many of the same control cases.
Whether that is good or not is for you to decide. But it may be that your set of covariates could be
adjusted so as to spread out the distance values. Of course, the choice of covariates needs to be guided by analytical considerations,
but it is worth noting that having at least one continuous measure (with a wide varitety of 
values) in the covariate set is helpful toward the goal of spreading out the distance measure. If on the other hand, there are no
continuous variable {c -} just a few dicotomous variables, then the distance measure will range over a small set of values.
That, in turn, makes it likely that many pairs of cases will have the exact same distance measure, and consequently,
many treated cases may have the same best match (and second best, etc) among the potential control cases.
For example, if the covariate set contains just five dicotomous variables, the
distance measure will have at most 32 distinct values 

{p 4 4 2}
In the situation where the {it:matchnum} value is high, such as 8 in the example given above, it may be worth examining the
spread of the distance measures in {it:filename1}; this would be done for the first {it:n} potential matches for each given treated case,
where {it:n} is the maximal chosen {it:matchnum} in {it:filename2} for the given treated case {c -} or for the first {it:n} matches
among all potential matches, where {it:n} is the maximal {it:matchnum} in {it:filename2} overall.
If they are close together, then a "poor" choice of a match may not be so bad after all; it
may have been nearly as good as the preferred ones. If, on the other hand, they are spread out, then the "poor" choices are
truly less desirable.

{p 4 4 2}
Conceivably, {cmd:mahaselectunique} could use a file written by {cmd:mahascores}, with the {cmd:genfile} and {cmd:treated} options, as well as
{cmd:name1} and {cmd:name2}, however the file would need to be sorted on {it:name1} and {it:scorevar}, and have {it:matchnum}
generated. ({cmd:by} {it:name1} {cmd:(}{it:scorevar}{cmd:): gen int} {it:matchnum}{cmd:=_n}.) But you will need the
{cmd:nostrict} option in {cmd:mahaselectunique}.

{p 4 4 2}
After running {cmd:mahaselectunique}, you will be left with reords from {it:filename1} where {it:matchnum} >0, but sorted in a random
order.

{title:Example, including call to mahapick}

{p 4 8 2}
{cmd:. mahapick income age numkids, idvar(id0) genfile(match1)}
{cmd:nummatches(12)treated(assisted)}{p_end}
{p 4 8 2}
{cmd:. mahaselectunique, idvar(id0) usefile(match1) writefile(match2)}
{cmd:seed(14716)}{p_end}


{title:Acknowledgement}
{p 4 4 2}
The author wishes to thank Paula Arce for the inspiration to write this program.

{title:Author}
{p 4 4 2}
David Kantor. Email {browse "mailto:kantor.d@att.net":kantor.d@att.net} if you observe any
problems.

{title:Also See}
{p 4 4 2}
{help mahapick}, {help mahascore}, {help mahascores}, {help mahascore2}, {help covariancemat}, {help variancemat},
{help screenmatches}, {help stackids}.

~~~~add mahaselectunique to the other help files~~~~