/* ado\mahapick.ado David Kantor, kantor.d@att.net (formerly dkantor@jhu.edu) Initially developed at The Institute for Policy Studies, Johns Hopkins University Began 4-4-2003 This is to perform a mahalanobis selection. See mahascore.ado for the score generation. */ *! version 1.2.1 2012feb8 /* prior 1.2.0 3-31-2008 (fix comments, 4-2-2008) Prior 1.1.12 1-24-2006 */ /* Revisions 5-9-2003: Making the variance to be calculated on the treated set -- not on the whole set. There are corresponding changes to mahascore and variancemat. 9-9-2003: multiple picks. We do stable sorts, so as to have consistent results whenever there are ambiguous choices. You may want to do a pre-sort in order to make this truly consistent. 9-24-2003: accomodate string id vars. (Some additional code needed in the use of the -clear- option. Otherwise, this already allowed for string id vars.) 11-7-2003: remove the treated() option from the call to mahascore; accomodates a change to mahascore. 11-21-2003: adding the -common- option [removed 3-26-2008 -- but in the new scheme, it is always "common", based on how covariancemat works]; simply passes this on to variancemat. The effect is to limit the variace calculations to those cases that are nonmissing on all the covariates (varlist). Displaying the varmat; cancelling the use of disp_var. 12-3-2003: Implementing -omitmiszer- option to drop vars from varlist if variance is missing or zero. [removed 3-26-2008] Also, some improvements in the reporting of failures by mahapicksome. Also, in mahapicksome, if there are no non-treated non-ref obs, it used to cause an error; now it just continues, with no matches assigned. Also, putting in a preserve/restore that applies to sliceby. A -fast- option goes with that. 12-11-2003: putting in the option to generate a new file (using -post-) in long form; the option is -genfile-. This is an alternative to -pickids-. Note that pickids makes a wide data set of match ids. Consequently, pickids now becomes optional. NEW options: genfile replace prime_id matchnum nummatches full 1-6-2004: ?? 2-15-2005: using idvarmisval where -clear- is implemented; moving the setting of idvarmisval to an earlier point. Also, making it mandatory that pickids components have the same type as idvar. Also, making clear valid only if pickids is specified. 2-16-2005: continuing. 2-17-2005: correcting the detection of multiple score values at end of mahapicksome. 4-25-2005 & 4-26-2005: adding the -score-, -scorevar-, and -all- options. Also, switching the order of vars in the genfile post; putting prime_id first. (Has no substantive effect.) 5-23-2005 ?? 1-24-2006: just fixing some comments. 3-26-2008: use a new version of mahascore. It generates a true Mahalanobis measure -- not the normalized Euclidean, as was done previously. (But see euclidean option.) Call covariancemat instead of variancemat. Cancel the common option; it is now always "common". Cancel the omitmiszer option. Put in a euclidean option -- equivalent to old behavior. Added weights -- for the covariance computation. Added option unsquared. Added option float. (But note that `float' is NOT passed to mahascore, as it is not necessary to make that limitation there, and it may enable more ties. Instead, just let the score variable (in genfile) be float; that's where it has a real effect.) Using a tempname for the genfile handle; improves break behavior. Fixed bug in under-reporting of equal scores. 2012feb8: enhanced handling of errors from covariancemat. */ prog def mahapick version 8.2 #delimit ; syntax varlist(numeric) [aw fw iw pw], idvar(varname) treated(varname numeric) [ pickids(varlist) genfile(string) prime_id(name) replace matchnum(name) scorevar(name) nummatches(integer 1) full matchon(varlist) sliceby(varlist) clear fast score all UNSQuared EUCLidean float DISPlay(string) noCOVTRLIMitation ] ; #delimit cr /* For each treated item, form the score and then pick the lowest scored untreated case. This is matching with replacement. Note that the score is distinct for each treated case. (You _could_ have a var for each, but that would make for MANY vars. Actually, it might not be a bad thing, but I am not doing it for now. P.S., under the genfile and score options, you can save the scores in LONG form. And with -all-, you can get all possible matches and the scores ) The -scorevar- option is what to name the score variable in the -score- suboption of -genfile-. It is not exactly the same as the scorevar option to mahapicksome (which is passed as a tempvar), though they are effectively the same entity. See mahascore.ado We expect that the pickids are of the same type as idvar. Originally pickid was passthru, but now (4-7-2003) I want to check to see that it starts out as missing -- or clear it. Adding the clear option. 4-14-2003: sliceby is implemented. Note that the sorting seems to be the big time user-upper. If you could partition the set by some subset of the matchon values (but use the already-made scores) then this might speed-up considerably. This is what sliceby does. */ capture assert `treated'==0 | `treated'==1 if _rc ~=0 { disp as error "treated must be valued {0, 1}" exit 198 } if "`covtrlimitation'" =="" { local iftreated_for_cov "if `treated'" } if "`pickids'" == "" & "`genfile'" == "" { disp as error "You must use -genfile- or -pickids- (or both)." exit 198 } if "`pickids'" == "" & "`clear'" ~= "" { disp as error "-clear- not valid without -pickids-" exit 198 } local typ1: type `idvar' if substr("`typ1'",1,3) == "str" { local idvarmisval `""""' } else { local idvarmisval "." } if "`pickids'" ~= "" { foreach var of local pickids { local typ2: type `var' capture assert "`typ2'" == "`typ1'" if _rc { disp as error "var `var' of pickids is not the same type as idvar `idvar'" exit 198 } if "`clear'" == "" { // -clear- was NOT specified capture assert mi(`var') if _rc ~= 0 { disp as error "pickid (`var') must be all missing to start, or use the " as input "clear" as error " option." exit 198 } } else { // -clear- WAS specified qui replace `var' = `idvarmisval' } } } if "`genfile'" ~= "" { if "`prime_id'" == "" { local prime_id "_prime_id" } if "`matchnum'" == "" { local matchnum "_matchnum" } if "`scorevar'" == "" { local scorevar "_score" } /* We could assure that the original `scorevar' is specified only if -score- is specified. Let's not get that picky. */ assert_distinct prime_id matchnum `prime_id' `matchnum' assert_distinct prime_id scorevar `prime_id' `scorevar' assert_distinct matchnum scorevar `matchnum' `scorevar' if "`replace'" == "" { confirm new file `genfile' } else { capture confirm new file `genfile' local rc1 = _rc capture confirm file `genfile' local rc2 = _rc if `rc1' & `rc2' { disp as error "filespec for genfile invalid: `genfile'" exit 198 } } if "`score'" ~= "" { local scoretyp "double" if "`float'" ~= "" { local scoretyp "float" } local score_elt "`scoretyp' `scorevar'" } tempname genfile_handle postfile `genfile_handle' `typ1' (`prime_id' `idvar') int `matchnum' `score_elt' using `genfile', `replace' disp as text "file `genfile' opened for posting" } else { // no genfile local x_without_genfile 0 if "`prime_id'" ~= "" { disp as error "prime_id invalid without genfile" local x_without_genfile 1 } if "`matchnum'" ~= "" { disp as error "matchnum invalid without genfile" local x_without_genfile 1 } if "`replace'" ~= "" { disp as error "replace invalid without genfile" local x_without_genfile 1 } if "`full'" ~= "" { disp as error "full invalid without genfile" local x_without_genfile 1 } if "`score'" ~= "" { disp as error "score invalid without genfile" local x_without_genfile 1 } if "`scorevar'" ~= "" { disp as error "scorevar invalid without genfile" local x_without_genfile 1 } if "`all'" ~= "" { disp as error "all invalid without genfile" local x_without_genfile 1 } if `nummatches' > 1 { /* We don't test for "`nummatches'" ~= "" because it is always present. That is because it is an optional integer, and thus must have a default value. This is a quirk of stata syntax. Thus, given this code, you can type nummatches(1) without genfile -- and get away with it. */ disp as error "nummatches invalid without genfile" local x_without_genfile 1 } if `x_without_genfile' { exit 198 } } quietly count if `treated' local Ntreated = r(N) disp as text "mahapick called; num treated: `Ntreated'" if "`sliceby'" ~="" { disp as text "sliceby = `sliceby'" } else { disp as text "sliceby not specified" } if index("`display'", "summ") { disp _new as text "summary of covars for the treated cases:" disp as text "number of treated cases: `Ntreated'" sum `varlist' [`weight' `exp'] if `treated', sep(0) } /* Get an inverse covariance matrix; adapt some code from mahascores. */ /* covariancemat is an ado by d.k. */ tempname cov invcov capture covariancemat `varlist' `iftreated_for_cov' [`weight' `exp'], covarmat(`cov') if _rc { disp as err "error from covariancemat" error _rc } if "`euclidean'" ~= "" { /* -euclidean- option means to zero the off-diagonal elements. It should be equivalent to the behavior of the old mahascore or mahapick. This needs to be done BEFORE the inverting. The resulting inverted matrix should be just the reciprocals in the diagonal. */ local crows = rowsof(`cov') // local ccols = colsof(`cov') -- should equal crows forvalues j = 2 / `crows' { forvalues k = 1 / `=`j'-1' { mat `cov'[`j', `k'] = 0 mat `cov'[`k', `j'] = 0 } } } if index("`display'", "covar") { disp _new as text "covariance matrix:" mat list `cov' } /* disp "begin inv(cov)" */ mat `invcov' = inv(`cov') /* -- you may be able to do invsym(`cov') But there are pitfalls, if `cov' is not positive-definite. (It ought to be at least positive-semidefinite.) But the most general choice is inv(). */ /* disp "end inv(cov)" */ /* End of segment adapted from mahascores. */ /* The use of covariancemat speeds up mahascore. But also note that variancemat is called for the whole set of treated obs. This is done so as to get a covariance matrix for the whole set -- not for each slice separately (if `sliceby' is used), which is what you would get if this weren't here. */ if index("`display'", "invcov") { disp _new as text "invcovariance matrix:" mat list `invcov' } /* Form groups by `sliceby', if it exists. This is supposed to speed up the process, as the big time-waster is the sorting that occurs on each pick. sliceby, generally speaking, is like extra matchon vars. BUT to avoid confusion, we will require that it be a subset of matchon. (Generally, the actual set of vars on which we are constrained to match, is the sliceby _union_ matchon.) */ if "`sliceby'" ~="" { if "`matchon'" == "" { disp as err "You must specify matchon with sliceby" exit 198 } foreach var of local sliceby { if ~index("`matchon'", "`var'") { disp as error "sliceby must be a subset of matchon" exit 198 } } /* As I mentioned, it is not functionally necessary for sliceby to be a subset of matcho. I just think it makes for clearer syntax. */ tempvar group1 egen long `group1' = group(`sliceby'), mis // note: egen is sortpreserve sum `group1', meanonly local numslices = r(max) if "`fast'" == "" { preserve } tempfile base1 save `base1' /* Note: here's where it will help if the file is reduced to only vars that are necessary. */ disp as text "numslices = " as res "`numslices'" disp _new "---- sliceby has been specified" forvalues fileno = 1 / `numslices' { use if `group1' == `fileno' using `base1' quietly count local N1 "`r(N)'" quietly count if `treated' local Ntr "`r(N)'" /* Note: Ntr is the count of treated cases in the present slice; there is also Ntreated, the count of the treated cases in the full set. */ disp _new as text " processing slice " as res "`fileno'" as text "; N=" as res "`N1'" as text "; treated=" as res "`Ntr'" mahapick_from_slice `varlist', idvar(`idvar') treated(`treated') /// pickids(`pickids') /// postfilehandle(`genfile_handle') /// matchon(`matchon') invcovarmat(`invcov') /// genfile(`genfile') nummatches(`nummatches') `full' `score' `all' /// `unsquared' tempfile t_`fileno' save `t_`fileno'' } drop _all forvalues fileno = 1 / `numslices' { if `fileno' ==1 { use `t_`fileno'' } else { append using `t_`fileno'' } } if "`fast'" == "" { restore, not } } else { // sliceby was not specified. disp as text _new "---- sliceby was not specified" mahapick_from_slice `varlist', idvar(`idvar') treated(`treated') /// pickids(`pickids') /// postfilehandle(`genfile_handle') /// matchon(`matchon') invcovarmat(`invcov') /// genfile(`genfile') nummatches(`nummatches') `full' `score' `all' /// `unsquared' /* Note that this could have been programmed as a degenerate case of the slices -- with just one slice. It would have been simple code, and it is tempting to do that. But it would imply it being saved and re-used for no good reason. (But maybe that could be programmed around. Something to consider.) */ } if "`genfile'" ~= "" { postclose `genfile_handle' disp as text "file `genfile' closed" } end //mahapick prog def mahapick_from_slice /* Pick from the current slice of the file. */ version 8.2 #delimit ; syntax varlist, idvar(passthru) treated(varname) invcovarmat(passthru) [ pickids(passthru) matchon(passthru) genfile(passthru) nummatches(passthru) full all postfilehandle(passthru) score UNSQuared /* float */ ] ; #delimit cr tempvar not_treated gen byte `not_treated' = ~`treated' tempvar score1 sort `not_treated', stable quietly count if `treated' local Ntreated = r(N) forvalues n1 = 1 / `Ntreated' { mahascore `varlist', refobs(`n1') gen(`score1') `invcovarmat' /// `unsquared' /* `float' */ /* treated(`treated') */ // sum `score1' mahapicksome, `idvar' refobs(`n1') treated(`treated') scorevar(`score1') /// `matchon' `pickids' `genfile' `nummatches' `full' `score' `all' `postfilehandle' // mahapicksome is sortpreserve; so don't worry about the sort order. drop `score1' // disp "time: $S_TIME" } end // mahapick_from_slice prog def mahapicksome, sortpreserve /* This used to be mahapick1, but I am adapting it to pick several matches. */ version 8.2 #delimit ; syntax , idvar(varname) refobs(integer) treated(varname) scorevar(varname) [ pickids(varlist) matchon(varlist) genfile(string) nummatches(integer 1) full all postfilehandle(name) score ] ; #delimit cr /* Given a scoring, select an observation to match a given refobs. Later (9-9-2003), this was expanded to possibly multiple observations -- as many as there are variables in `pickids'. This does matching WITH REPLACEMENT. There is no guarantee that replicated selections won't be made. But with a large set to choose from, you generally don't get too much replication. If multiple equal-scored items are found, the first is picked. I have made all the sorting stable, so, if there are any such cases, the results will be consistent, provided that the initial sort order is consistent. This will sort; but will implicity restore order (sortpreserve). That sortpreserve is important. The calling program needs to have the same order after the return. If matchon is specified, then the matching is constrained to cases that match on these vars. Best if they are integer-valued. Note that the sorting seems to be the big time waster. Thus, this is improved when the calling program slices up the data into smaller pieces. If the score is missing on the selected item, then return missing; i.e., do not select this one; there is no basis for choosing it. This is a change as of 4-18-2003. Previously we did issue a warning, but also returned the selected case. 9-9-2003: Previously, this was rclass, and returned the result in r(pick); the calling program then set the pickid. But now, I am making this do the setting of pickid. To do this, we needed to set up a pickid option (required). One fault with the old system: the pickid must be numeric. I believe that now, this will work on any data type for the pickids. Part of this change is to place the refobs into position 1 in the sort. Thus, all pickid values are loaded into [1]. Making pickid a varlist rather than a varname; i.e., it can have several vars. Thus renaming it to pickids. Note that the vars in pickids should be all of the same type, which should be the same as that of idvar. */ disp "mahapicksome called; refobs= " as res "`refobs'" as text "; id (`idvar') = " %12.0f as res `idvar'[`refobs'] capture assert `refobs' >=1 & `refobs' <= _N if _rc ~=0 { disp as error "refobs must be in the range 1 .. _N" exit 198 } capture confirm numeric var `treated' if _rc ~=0 { disp as error "treated must be a numeric var" exit 198 } capture assert `treated'==0 | `treated'==1 if _rc ~=0 { disp as error "treated must be valued {0, 1}" exit 198 } capture assert `treated'[`refobs'] if _rc ~=0 { disp as error "Warning: refobs `refobs' is not treated." // but this is not fatal } if index("`matchon'", "`treated'") { disp as error "treated(`treated') may not be part of the matchon vars." exit 198 } tempvar is_refobs not_refobs gen byte `is_refobs' = _n == `refobs' quietly count if `is_refobs' assert r(N) ==1 gen byte `not_refobs' = ~`is_refobs' tempvar is_treated_or_ref gen byte `is_treated_or_ref' = `treated' | `is_refobs' // We expect, usually, but do not require, that the ref is treated. tempvar not_treated_or_ref gen byte `not_treated_or_ref' = ~`is_treated_or_ref' quietly count if `is_treated_or_ref' local k1 = r(N) + 1 drop `is_treated_or_ref' if `k1' > _N { disp as error "no non-treated non-ref obs; no matches to be made" exit /* formerly exit 198, but now (12-3-2003) just exit and continue, not an error; but no matches will be made. */ } tempvar matchok not_matchok gen byte `matchok' = 1 foreach var of local matchon { quietly replace `matchok' = `matchok' & (`var' == `var'[`refobs']) } gen byte `not_matchok' = ~`matchok' sort `not_treated_or_ref' `not_refobs' `not_matchok' `scorevar', stable /* That could have been done with gsort -- eliminating the confusing use of the "not_..." variables, but gsort doesn't have -stable-. */ assert `is_refobs'[1] /* -- that means that, the refobs will land in place 1. (Remember that only one obs is `is_refobs'.) This will be useful in the following, where we replace something -in 1-. We could have do replace ... if `is_refobs', but the other is more efficient, especially if we will be doing multiple picks, as we plan to do later. */ drop `is_refobs' local num_pickids "0" if "`pickids'" ~= "" { local d1 "0" local nmiss "0" local notokay "0" foreach var of local pickids { local ++num_pickids capture assert mi(`var'[1]) if _rc ~=0 { disp as error "pickid (`var') must be missing in refobs (`refobs')" exit 198 } local k2 = `k1'+`d1' if `k2' <= _N & `matchok'[`k2'] { if mi(`scorevar'[`k2']) { local ++nmiss local pickidscoremis "`pickidscoremis' `var'" } else { quietly replace `var' = `idvar'[`k2'] in 1 } } else { local ++notokay local pickidnotokay "`pickidnotokay' `var'" } local ++d1 } if `nmiss' >0 { disp as error "Warning: distance measure missing for `nmiss' of the pickids:" disp as error " `pickidscoremis'" } if `notokay' >0 { disp as error "Warning: no matching obs found for `notokay' of the pickids." disp as error " `pickidnotokay'" /* -- that refers to a lack of qualified match candidates (possibly limited by the use of -matchon-). */ } } if "`genfile'" ~= "" { local nmiss "0" local notokay "0" local typ1: type `idvar' if substr("`typ1'",1,3) == "str" { local idvarmisval `""""' } else { local idvarmisval "." } if "`score'" ~= "" { local score_elt "(`scorevar'[1])" // ought to be 0 } post `postfilehandle' (`idvar'[1]) (`idvar'[1]) (0) `score_elt' if "`all'" ~= "" { // take ALL possible matches count if `matchok' & `not_treated_or_ref' local nummatches = max(`nummatches', r(N)) //-- replace it with the actual number of available matches, if larger. } forvalues j1 = 1 / `nummatches' { local k2 = `k1'+`j1' - 1 if ("`full'" == "") & (`k2' > _N) { continue, break } local ok_to_post 0 if "`score'" ~= "" { local score_elt "(`scorevar'[`k2'])" } if `k2' <= _N & `matchok'[`k2'] { if mi(`scorevar'[`k2']) { local ++nmiss } else { local ok_to_post 1 } } else { local ++notokay if "`score'" ~= "" { local score_elt "(.)" // note 1 } } if `ok_to_post' { post `postfilehandle' (`idvar'[1]) (`idvar'[`k2']) (`j1') `score_elt' } else if "`full'" ~= "" { post `postfilehandle' (`idvar'[1]) (`idvarmisval') (`j1') `score_elt' // note 2 } } if `nmiss' >0 { disp as error "Warning: distance measure missing for `nmiss' candidates" } if `notokay' >0 { disp as error "Warning: no matching obs found for `notokay' of the nummatches" /* -- that refers to a lack of qualified match candidates (possibly limited by the use of -matchon-). */ } } /* Note 1: Give a missing score value -- in case we have ~`ok_to_post' and "`full'" ~= "" (and "`score'" ~= ""). Then, if `k2' <= _N & ~`matchok'[`k2'], then the scores may well exist, but they are meaningless, in that they refer to invalidated matches. And the scores may likely be less than those among the valid matches. (I.e., the scores increas as you go down the list of ok matches; but once you go past the `matchok' segment, the scores may restart at a lower value. Of course, if `k2' > _N, then there is no score at all. But if we have ~`ok_to_post' and "`full'" ~= "", where the reason for ~`ok_to_post' is a missing score, then the score_elt will naturally be missing. So no need to intervene in that case. Note 2: We could have done this: if "`score'" ~= "" { local score_elt "(.)" } post `postfilehandle' (`idvar'[1]) (`idvarmisval') (`j1') `score_elt' That would take care of all invalid-but-posted cases (due to -full- and -score-), and it would obviate the need for setting score_elt at the point where note 1 is. But it would replace any extended missing values -- if any (which I don't expect) with sysmis. */ /* Report scores of 0: */ quietly count if `scorevar'==0 & `not_treated_or_ref' & `matchok' if r(N) >1 { disp as text "Note: there are " r(N) " cases of score==0" // That's one possible cause of multiplicities. } /* Report equal scores: */ local num_picks = max(`num_pickids', `nummatches') local kmax = `k1' + `num_picks' /* prior to 3-27-2008, kmax was `k1' + `num_picks' -1. That captured all the scores of the chosen items. BUT it omits one potential ambiguity: if the LAST item has the same score as the next one in line. E.g., if you pick 3 matches as match 3 has same score as match 4 (were it to be included). This higher value of kmax should take care of that situation. */ forvalues j1 = `=`k1'+1' / `kmax' { if `j1' <= _N & ~mi(`scorevar'[`j1']) { if (`scorevar'[`j1'] == `scorevar'[`j1'-1]) & /// ((`j1' < (`k1'+2)) | (`scorevar'[`j1'] ~= `scorevar'[`j1'-2]) ) { disp as text "Note: equal scores found; score: " as res `scorevar'[`j1'] /* And an arbitrary choice was taken, the one that is earler in the order. */ } } else { continue, break } } /* Note: in the the above code, the extra condition, ((`j1' < (`k1'+2)) | (`scorevar'[`j1'] ~= `scorevar'[`j1'-2]) ) prevents multiple reports of the same repeated score. Otherwise, if the same score value occurs more that twice, it would be reported more than once. Also note that this only checks adjacent pairs; that's all you need, since the set is sorted by `scorevar'. */ end // mahapicksome prog def assert_distinct args name1 name2 c1 c2 /* `name1' and `name2' are allegedly the names of macros. `c1' and `c2' are supposed to be the corresponding contents. We need to pass names and contents separately, because they are local to the calling context. */ if "`c1'" == "`c2'" { disp as error "`name1' and `name2' must be distinct" exit 198 } end // assert_distinct