{smcl} {* 15apr2017}{...} {cmd:help rangejoin} {hline} {title:Title} {phang} {cmd:rangejoin} {hline 2} Form pairwise combinations if a key variable is within range {title:Syntax} {p 8 17 2} {cmd:rangejoin} {it:keyvar low high} {cmd:using} {it:{help filename:using_dataset}} [{cmd:,} {it:options}] {synoptset 20 tabbed}{...} {synopthdr} {synoptline} {synopt:{opth by(varlist)}}pairwise combinations occur within groups{p_end} {synopt:{opth k:eepusing(varlist)}}variables to keep from the {it:{help filename:using_dataset}}{p_end} {synopt:{opth p:refix(strings:string)} }a stub used as a prefix when renaming variables from the {it:{help filename:using_dataset}}{p_end} {synopt:{opth s:uffix(strings:string)}}a stub used as a suffix when renaming variables from the {it:{help filename:using_dataset}}{p_end} {synopt:{opt a:ll}}all variables from the {it:{help filename:using_dataset}} will be renamed{p_end} {synoptline} {p2colreset}{...} {title:Description} {pstd} This version of {cmd:rangejoin} requires version 1.1.0 of {stata ssc des rangestat:rangestat}. Click {stata ssc install rangestat:here to install} {cmd:rangestat} from SSC. {pstd} {cmd:rangejoin} forms pairwise combinations of observations in memory and observations from the {it:{help filename:using_dataset}} when the value of {it:keyvar} in the {it:{help filename:using_dataset}} is within the range specified by {it:low} and {it:high} in the data in memory. {pstd} {it:keyvar} is a numeric variable in the {it:{help filename:using_dataset}}. {pstd} {it:low} or {it:high} can be specified using a numeric variable in the data in memory. Alternatively, {it:low} or {it:high} can be specified using a {it:#} (a number in Stata parlance). If a {it:#} is used and there is a numeric variable with the same name as {it:keyvar} in the data in memory, the bound for each observation is computed by adding {it:#} to {it:keyvar}. If a {it:#} is used and {it:keyvar} does not exist in the data in memory, the {it:#} is used as the bound. Finally, you can specify {it:low} or {it:high} using a {help missing:system missing value}, in which case the bound for each observation will be missing. {pstd} {cmd:rangejoin} applies the same rules as {help inrange()} for missing bounds: if the lower bound is missing, observations will match up to and including the value of {it:high}. If both {it:low} and {it:high} are missing, all observations will match. Note that the treatment of missing values for {it:low} and {it:high} differs in this version of {cmd:rangejoin} and this may require that previous code be adapted. Without bounds, {cmd:rangejoin} forms the same pairwise combinations that {help cross} would (or {help joinby} if the {opth by(varlist)} option is used to restrict the matching by group). {pstd} Observations with missing values for {it:keyvar} in the {it:{help filename:using_dataset}} will never be considered in range and as such will never match. To prevent unintentional matches, if there is a variable of the same name as {it:keyvar} in the data in memory, no match will occur if its value is missing for the current observation (since {it:low} or {it:high} will most likely be computed relative to the value of {it:keyvar}, which would lead to missing bounds). {pstd} {cmd:rangejoin} will not try to find matches for observations where {it:low} > {it:high}. {pstd} The results will include all variables from the {it:{help filename:using_dataset}} unless the {opth k:eepusing(varlist)} is specified. Any variable from the {it:{help filename:using_dataset}} will be renamed if it also exists in the data in memory. Use the {opt a:ll} option if all variables from the {it:{help filename:using_dataset}} are to be renamed. When renaming variables, the {opth p:refix(strings:string)} and {opth s:uffix(string)} stubs are both used. If neither is specified, the default is to use {cmd:suffix(_U)}. Variables specified in {opth by(varlist)} must exist both in the data in memory and in the {it:{help filename:using_dataset}}. {title:Example: Finding houses within the client's budget} {pstd} In the following example, you have a list of houses on the market and the current asking price. You also have clients looking for a house and their price range. {space 8}{hline 27} {it:example do-file content} {hline 27} {cmd}{...} {* example_start - house_cross}{...} clear input house asking 1 111 2 222 3 333 4 444 5 555 6 666 end save "house_asking.dta" clear input str5 name low high Peter 300 500 Paul 400 600 Mary 600 700 end rangejoin asking low high using "house_asking.dta" sort name house list, sepby(name) {* erase "house_asking.dta"}{...} {* example_end}{...} {txt}{...} {space 8}{hline 80} {space 8}{it:({stata rangejoin_run house_cross using rangejoin.sthlp:click to run})} {pstd} Do this again, this time matching clients to houses in the same zip code. {space 8}{hline 27} {it:example do-file content} {hline 27} {cmd}{...} {* example_start - house_joinby}{...} clear input house asking zip 1 111 48101 2 222 48101 3 333 48101 4 444 48101 5 555 48103 6 666 48103 end save "house_asking.dta" clear input str5 name low high zip Peter 300 500 48103 Paul 400 600 48103 Mary 600 700 48101 end rangejoin asking low high using "house_asking.dta", by(zip) sort name house list, sepby(name) {* erase "house_asking.dta"}{...} {* example_end}{...} {txt}{...} {space 8}{hline 80} {space 8}{it:({stata rangejoin_run house_joinby using rangejoin.sthlp:click to run})} {pstd} If no house falls within the client's price range, the observation remains but with missing values for variables from the using dataset. {title:Match domestic cars to similarly priced foreign cars with the same repair record} {pstd} Using the iconic Stata auto dataset, we match each domestic car to similarly priced foreign cars with the same repair record. {space 8}{hline 27} {it:example do-file content} {hline 27} {cmd}{...} {* example_start - similar_car}{...} sysuse auto, clear keep if foreign save "foreign_cars.dta" sort rep78 price list make price rep78, sepby(rep78) sysuse auto, clear drop if foreign rangejoin price -1000 1000 using "foreign_cars.dta", by(rep78) gen pdiff = price - price_U sort rep78 make price_U list make price rep78 make_U price_U pdiff, sepby(rep78 make) {* erase "foreign_cars.dta"}{...} {* example_end}{...} {txt}{...} {space 8}{hline 80} {space 8}{it:({stata rangejoin_run similar_car using rangejoin.sthlp:click to run})} {pstd} Note that if you do this to calculate some statistic using the joined data and then reduce (collapse) to the original observations, you are probably better off using {stata ssc des rangestat:rangestat} directly. {title:Certification and efficiency considerations} {pstd} {cmd:rangejoin} leverages the power of {cmd:rangestat} to quickly determine, for each observation in memory, the set of observations in the {it:{help filename:using_dataset}} that are in range. Once this is known, each observation in the data in memory is expanded by the number of observations it matched in the {it:{help filename:using_dataset}}. From there, the problem reduces to a m:1 {help merge}. {pstd} You can replicate {cmd:rangejoin} results using {help cross} or {help joinby} but this requires forming all pairwise combinations (within groups if you are using {help joinby}) and then dropping observations outside the desired interval bounds. {pstd} To demonstrate both approaches, the following creates a dataset with a 10% ratio of cases to controls in 5 categories. It also saves the controls in a separate dataset. For added flexibility, the code lets you choose the number of observations. If this is the first time, start with 10,000 observations. Note that you can't go much higher if you plan to run the example using {cmd:joinby} as the total number of pairwise combinations grows exponentially. {space 8}{hline 27} {it:example do-file content} {hline 27} {cmd}{...} {* example_start - efficiency_data}{...} clear set seed 41234 dis "How many observations? " _request(nobs) set obs $nobs gen id = _n gen case = runiform() < .1 gen category = int(runiform() * 5) + 1 gen date = mdy(1,1,2016) + runiform() * 365 format %td date sum, format tab case save "project_data.dta", replace keep if case==0 drop case save "control_data.dta", replace {* example_end}{...} {txt}{...} {space 8}{hline 80} {space 8}{it:({stata rangejoin_run efficiency_data using rangejoin.sthlp:click to run})} {pstd} Now match each case with controls in the same category if the date of the control is within +/- 1 day of the case date. {space 8}{hline 27} {it:example do-file content} {hline 27} {cmd}{...} {* example_start - efficiency_rs}{...} use if case using "project_data.dta", clear rangejoin date -1 1 using "control_data.dta", by(category) suffix(_ctr) drop if mi(id_ctr) // drop cases that did not find a match sort id id_ctr save "rangejoin_results.dta", replace {* example_end}{...} {txt}{...} {space 8}{hline 80} {space 8}{it:({stata rangejoin_run efficiency_rs using rangejoin.sthlp:click to run})} {pstd} Repeat the same task using {cmd:joinby}. There's more code to write because you must rename variables to avoid name conflicts when joining the data. {space 8}{hline 27} {it:example do-file content} {hline 27} {cmd}{...} {* example_start - efficiency_joinby}{...} use "control_data.dta", clear rename id id_ctr rename date date_ctr tempfile controls save "`controls'" use if case using "project_data.dta" joinby category using "`controls'" keep if inrange(date_ctr, date-1, date+1) * confirm that we match exactly the same controls as rangejoin sort id id_ctr cf _all using "rangejoin_results.dta", all {* example_end}{...} {txt}{...} {space 8}{hline 80} {space 8}{it:({stata rangejoin_run efficiency_joinby using rangejoin.sthlp:click to run})} {title:References} {pstd} Cox, N.J. 2007. {browse "http://www.stata-journal.com/sjpdf.html?articlenum=pr0033":Events in intervals.} {it:Stata Journal} 7: 440{c -}443. {pstd} Cox, N.J. 2011. {browse "http://www.stata-journal.com/sjpdf.html?articlenum=dm0055":Compared with ....} {it:Stata Journal} 11: 305{c -}314. {pstd} Cox, N.J. 2014. {browse "http://www.stata-journal.com/article.html?article=dm0075":Self and others.} {it:Stata Journal} 14: 432{c -}444. {title:Acknowledgements} {pstd} Thanks to Clyde Schechter for kindly showing us an example where {cmd:rangejoin} would generate an overflow when computing interval bounds if {it:keyvar + #} could not be stored in a variable of {it:keyvar}'s data type. This was most likely to bite when {it:keyvar} was a byte. Observations with the overflow would be excluded from the sample. This report led to a review of {cmd:rangejoin}'s handling of missing interval bounds and it was decided to follow the same rules as {help inrange()} and allow missing bounds. {title:Authors} {pstd}Robert Picard{p_end} {pstd}picard@netbox.com{p_end} {title:Also see} {psee} SSC: {stata "ssc desc rangestat":rangestat}, {stata "ssc desc tsegen":tsegen} {p_end} {psee} Others: {stata "search vlookup, all":vlookup} {p_end}