{smcl} {* 19-Apr2004; rev 8Feb2006, 1apr2008, 2012feb9} {hline} help for {hi:screenmatches} {hline} {title:Screen matching observations as prepared by {help mahapick}} {p 8 17 2} {cmd:screenmatches} {it:varlist} [{cmd:if} {it:exp}] , {cmd:gen(}{it:newvar}{cmd:)} {cmd:nummatches(}{it:#}{cmd:)} {cmd:matchnum(}{it:matchnumvar}{cmd:)} {cmd:prime_id(}{it:prime_id_var}{cmd:)} [ {cmdab:v:erbose} {cmd:summ tab}] {title:Description} {p 4 4 2} {cmd:screenmatches} provides a way of screening a dataset consisting of "treated" observations and their matched "control" observations. We seek to screen the control observations; treated observations are usually kept (except as affected by an {cmd:if} condition; but see notes about excluded treated observations). Presumably, this dataset has been created by {help mahapick} and is stored in long form as produced by the {cmd:genfile} option of {help mahapick} {c -} or by the use of the {cmd:pickids} option of {help mahapick}, followed by an invocation of {help stackids} or an equivalent procedure. You could, of course, create the dataset some other way, as long as the required structure is present. Essentially, the data should be in long form, such that the observations can be partitioned into subsets, with each subset corresponding to a treated observation. Each such subset includes the treated observation itself, plus several matching control observations. {p 4 4 2} In addition to the data structure alluded to above, it is presumed that some "content" data have been {help merge}d onto the given dataset. It is these content variables that we are screening on; these variables, or a subset thereof, are named in {it:varlist}. Typically, these content variables would include some items not found in the match covariates used in forming the match (in {cmd:mahapick}). {p 4 4 2} The purpose of this is to allow you to make a flexible dataset that can accomodate adjustments to the set of analyzed variables, or to the number of controls per treated observation. {p 4 4 2} Note that the present description speaks of "screening" or "keeping" observations. This actually occurs in terms of a generated indicator variable that represents the results of this screening. No observations are actually dropped. {title:Options} {p 4 4 2} {cmd:gen(}{it:newvar}{cmd:)} is required. This specifies the new variable to be generated, containing 0's and 1's, indicating which observations are screened in. {p 4 4 2} {cmd:nummatches(}{it:#}{cmd:)} is required. This specifies the number of matches per treated observation that you want to choose. {p 4 4 2} {cmd:prime_id(}{it:prime_id_var}{cmd:)} is required. This specifies a variable whose distinct values correspond to the treated observations, thereby enabling the partitioning of the dataset into subsets that correspond to the treated observations. {p 4 4 2} Typically, this would be the same variable as was specified in the {cmd:prime_id()} option (default name {cmd:_prime_id}) in {cmd:mahapick}, or the {cmd:idprimevar()} option in {cmd:stackids} {c -} though, all that really matters is that its distinct values correspond to the treated observations. {p 4 4 2} {cmd:matchnum(}{it:matchnumvar}{cmd:)} is required. This specifies a variable that enumerates the observations within each of the partition subsets that correspond to the treated observations. Typically, this would be the same variable as was specified in the {cmd:matchnum()} option (default name {cmd:_matchnum}) in {cmd:mahapick}, or the {cmd:matchnumvar()} option in {cmd:stackids}. {p 4 4 2} {it:matchnumvar} should be 0 for the treated observations, and should have positive values for the control observations, with the lower values corresponding to the best matches. Typically, 1 corresponds to the best match, 2 to the next best match, and so on, though the exact enumeration is not critical. {p 4 4 2} {cmdab:v:erbose} specifies that {cmd:screenmatches} will report the number of treated and control observations that were selected. {p 4 4 2} {cmd:summ} specifies that the the min and max of {it:matchnumvar} will be summarized. {p 4 4 2} {cmd:tab} specifies that the the min and max of {it:matchnumvar} will be tabulated. This only applies if {cmd:summ} was specified. {title:Remarks} {p 4 4 2} To help understand the purpose of this, suppose that in doing the matching process, you had a certain set of variables in mind for analysis. You are clever enough to notice that some of these variables have missing values in a few of the observations. So, in the matching process (prior to your call to {help mahapick}), you screened out these troublesome observations. Thus, the matching process yielded, not the best matches in general, but the best matches under the constraint that none of the targetted variables have missing values. {p 4 4 2} (Note that this assumes that some of the variables in question are not in the match covariates; missing values in match covariates will disqualify observations from being matched. See {help mahapick} for more on this.) {p 4 4 2} Your analysis will proceed correctly under the given scenario. But now suppose that you want to add or remove some variables to the analysis. Properly, to get the best matches for the revised analysis, you would be compelled to rerun the matching process, changing the pre-screening of observations to correspond to the revised set of variables. Alternatively, you can omit doing this, but then you are accepting an analysis set that is not optimal for the given analysis. It may... {p 8 8 2} have fewer control observations than what is actually possible (and you may also have a non-constant number of controls per treated observation), or, {p 8 8 2} overlook some control observations that are appropriate matches (and are better matches than the ones you have in their places). {p 4 4 2} Another scenario is that you might decide that, instead of, say, two controls per treated observation, you now want three. If your matching process collected only two, you would now be compelled to rerun it. {p 4 4 2} {cmd:screenmatches} allow you to avoid these problems, so you don't have to rerun your matches (or accept a suboptimal control set) every time you reformulate the analysis. This is done in combination with how you use {help mahapick}. In using {help mahapick}, you should {it:not} do any of the prescreening as described above. (Of course, it may be appropriate to do some screening, based on conditions that you are certain are universally applicable.){bind: }Also, you should call {help mahapick} with a {cmd:nummatches(}{it:#}{cmd:)} value that is considerably larger than the number of controls per treated that you actually want to analyze. A good choice might be for this {it:#} to be two or three times the number of controls per treated in your forseeable analyses. This provides a large reservior of possible matches, which can be thought of as a set of queues, one for each treated observation, with the best matches positioned at the head of each queue. {p 4 4 2} After you have collected the matches and have {help merge}d the content data onto them, you are ready for some sort of analysis. This is where you apply {cmd:screenmatches} telling it what variables are to be considered, and how many controls per treated observation you wish to use. {p 4 4 2} Note that in this process, you provide {help mahapick} with one {cmd:nummatches(}{it:#}{cmd:)} value; then you provide {cmd:screenmatches} with another, lower, {cmd:nummatches(}{it:#}{cmd:)} value. It is hoped that users will not be confused by the fact that these options in the two programs have the same name. You are {it:not} supposed to supply the same value to them. {p 4 4 2} If you have collected enough potential matches in the {cmd:mahapick} step, then you will be able to screen-in a constant number of controls per treated observation in the {cmd:screenmatches} step. That is, if in {cmd:mahapick}, you specify {cmd:nummatches()} sufficiently large (and the potential pool was large enough to begin with), then, in {cmd:screenmatches}, if you specify, for example, {cmd:nummatches(3)}, you will get three controls for each treated case {c -} rather than having fewer than three for some of the treated cases. {p 4 4 2} Users should be aware that there is a disadvantage to using this method. Some may find it awkward to have several variant analyses on the same set of treated cases, where the control set might vary slightly in its composition from one analysis to another. {p 4 4 2} Even if you don't end up in a situation that would compel you to rerun the matching, it should be understood that it may be valuable to use {cmd:screenmatches} along with the procedure described above. This is because the matching process may be relatively time-consuming. Thus, it is useful to be able to run it once (or a very few times), settng up the resulting dataset so that many different analyses can subsequently be run. Typically, you would want to do the matching and the anaylses in separate steps, saving intermediary files in between the steps. You would only want to rerun the matching to accomodate a change in the covariate set or {cmd:matchon()} options, or if the data were revised. {p 4 4 2} Important: if you specify an {cmd:if} condition which causes the exclusion of some treated observations, then all corresponding control observations are also excluded. {title:Examples} {p 4 4 2} These examples assume that the matching has already been done (as by {cmd:mahapick}, with a relatively high value for {cmd:nummatches(}{it:#}{cmd:)}), and "content" data have been merged on. {p 4 8 2} {cmd:.local regvars "worked_prior workhours_prior income_prior age agesq dsb married spouse_income_prior hsgrad"} {p_end} {p 4 8 2} {cmd:.screenmatches `regvars', nummatches(3) gen(k) matchnum(matchno) prime_id(id_prime)} {p_end} {p 4 8 2} {cmd:.regress income_current assisted `regvars' if k} {p_end} {p 4 4 2} Be sure to include any conditions on both the {cmd:screenmatches} and the analysis command: {p 4 8 2} {cmd:.screenmatches `regvars' if female, nummatches(3) gen(k) matchnum(matchno) prime_id(id_prime)} {p_end} {p 4 8 2} {cmd:.regress income_current assisted `regvars' if female & k} {p_end} {title:Author} {p 4 4 2} David Kantor; initial development was done at The Institute for Policy Studies, Johns Hopkins University. Email {browse "mailto:kantor.d@att.net":kantor.d@att.net} if you observe any problems. {title:Also See} {p 4 4 2} {help mahapick}, {help mahascore}, {help mahascores}, {help mahascore2}, {help covariancemat}, {help variancemat}, {help stackids}.