{smcl}
{* *! version 2.37.0 16aug2019}{...}
{vieweralsosee "ftools" "help ftools"}{...}
{vieweralsosee "fmerge" "help fmerge"}{...}
{vieweralsosee "[R] merge" "help merge"}{...}
{viewerjumpto "Syntax" "join##syntax"}{...}
{viewerjumpto "description" "join##description"}{...}
{viewerjumpto "options" "join##options"}{...}
{viewerjumpto "examples" "join##examples"}{...}
{viewerjumpto "about" "join##about"}{...}
{title:Title}

{p2colset 5 13 20 2}{...}
{p2col :{cmd:join} {hline 2}}Join/merge datasets{p_end}
{p2colreset}{...}

{marker syntax}{...}
{title:Syntax}

{pstd}
Many-to-one join/merge on specified key variables

{p 8 13 2}
{cmd:join}
[{varlist}]{cmd:,}
{opth from(filename)}
{opth by(varlist)}
[{it:options}]

{pstd}
As above, but with the "using" dataset currently open instead of the "master"

{p 8 13 2}
{cmd:join}
[{varlist}]{cmd:,}
{opth into(filename)}
{opth by(varlist)}
[{it:options}]

{pstd}
(Note: if {it:varlist} is specified, only a subset of the variables will be added)



{synoptset 24 tabbed}{...}
{synopthdr}
{synoptline}
{p2coldent:+ {cmd:from(}{help filename} [{help if}]{cmd:)}}filename of the {it:using} dataset, where the keys are unique{p_end}
{...}
{p2coldent:+ {cmd:into(}{help filename} [{help if}]{cmd:)}}filename of the {it:master} dataset{p_end}
{...}
{p2coldent:* {opth by(varlist)}}key variables; {it:master_var=using_var} is also allowed in case the variable names differ between datasets{p_end}
{...}
{synopt :{opt uniq:uemaster}}assert that the merge will be 1:1
{p_end}
{...}
{synopt :{cmd:keep(}{help join##results:{it:results}}{cmd:)}}specify which match results to keep
{p_end}
{...}
{synopt :{cmd:assert(}{help join##results:{it:results}}{cmd:)}}specify required match results
{p_end}
{...}
{synopt :{opth gen:erate(newvar)}}name of new variable to mark merge
      results; default is {cmd:_merge}
{p_end}
{...}
{synopt :{opt nogen:erate}}do not create {cmd:_merge} variable
{p_end}
{...}
{synopt :{opt nol:abel}}do not copy value-label definitions from using{p_end}
{...}
{synopt :{opt nonote:s}}do not copy notes from using{p_end}
{...}
{synopt :{opt keepn:one}}don't add any variables from using (overrides default of {it:_all}){p_end}
{...}
{synopt :{opt v:erbose}}show internal debug info
{...}
{p_end}
{synopt :{opt method(string)}}(advanced) set method used internally for hashing
{p_end}
{synoptline}
{p2colreset}{...}
{p 4 6 2}+ you must include either {opt from()} or {opt from()} but not both.{p_end}
{p 4 6 2}* {opt by(varlist)} is required.{p_end}


{marker description}{...}
{title:Description}

{pstd}
{cmd:join} is an alternative for {help merge},
supporting {it:m:1} and {it:1:1} joins.
{p_end}

{pstd}{bf:Advantages:}

{pmore}- Increasingly faster above 100,000 obs (it is O(N) instead of O(N log N)){p_end}
{pmore}- Does not sort the data by the keys{p_end}
{pmore}- Keys can have different names in master and using{p_end}

{pstd}{bf:Technical notes:}

{pstd}
{cmd:join} works by hashing the keys, instead of sorting the data like {cmd:merge} does
(see 
{browse "https://my.vertica.com/docs/7.1.x/HTML/Content/Authoring/AnalyzingData/Optimizations/HashJoinsVs.MergeJoins.htm":[1]}
and
{browse "http://support.sas.com/resources/papers/proceedings09/071-2009.pdf":[2]}
for a comparison of the hash+join and sort+merge algorithms).
As a result, {cmd:join} performs better if the datasets are not
already sorted by the {it:by()} variables, and for datasets
above 100,000 observations (due to Mata's overhead).
{p_end}


{marker options}{...}
{title:Options}

{dlgtab:Options}

{phang}
{opth from(filename)}
    specifies the {it:using} filename.
    This dataset is typically smaller than the master dataset
    (the one currently active)
    and the key variables set with {cmd:by()} must uniquely
    identify the dataset (see {cmd:isid} and {cmd:fisid}).

{pmore}
    If {cmd:by()} does not identify the {it:using} dataset
    but a subset of it, you can add use {if} within {cmd:from}

{pmore}
    EG: Suppose master has country-level data for a the year 2016,
    and {xyz} has country-year GDP data, then you can do
    {it:using(xyz if year==2016)}

{phang}
    {opth into(filename)} is useful if you are currently in the
    {it:using} dataset. This is faster than the alternative of
    saving, loading the master dataset, and then calling {it:join}
    (which would then have to open the using dataset again)

{phang}
{opth by(varlist)}
    specifies the key variables 
    that will be used to join both datasets.
    If the names of the variables differ, you can use
    {it:master_var=using_var} (e.g. {cmd:by(year=yr)})

{phang}
{opt uniquemaster}
    ensures that the match is 1:1 instead of m:1.
    Since a 1:m merge can be replicated with the {opt into()} option,
    this means that all merges except m:m (not recommended),
    and update/replace merges (less useful) are supported.

{phang}
{opth generate(newvar)}
    specifies that the variable containing match
    {help merge##results:results} information should be named {it:newvar}
    rather than {cmd:_merge}.

{phang}
{cmd:nogenerate} specifies that {cmd:_merge} not be created.  This
    would be useful if you also specified {cmd:keep(match)}, because
    {cmd:keep(match)} ensures that all values of {cmd:_merge} would be 3.

{phang}
{cmd:nolabel}
    specifies that value-label definitions from the using file be ignored.
    This option should be rare, because definitions from the master are
    already used.

{phang}
{cmd:nonotes}
    specifies that notes in the using dataset not be added to the 
    merged dataset; see {manhelp notes D:notes}.

{phang}
{cmd:keepnone}
    specifies that no variables from using will be added. Use this in combination
    with {cmd:keep(match)} if you just want to keep a group of observations
    from the master dataset.

{phang}
{cmd:noreport}
    specifies that {cmd:join} not present its summary table of
    match results.

{phang}
{cmd:assert(}{it:results}{cmd:)}
    specifies the required match results.  The possible
    {it:results} are 

{marker results}{...}
           numeric    equivalent
            code      word ({it:results})     description
           {hline 67}
              {cmd:1}       {cmdab:mas:ter}             observation appeared in master only
              {cmd:2}       {cmdab:us:ing}              observation appeared in using only
              {cmd:3}       {cmdab:mat:ch}              observation appeared in both

{pmore}
Numeric codes and words are equivalent when used in the {cmd:assert()}
or {cmd:keep()} options.

{pmore}
The following synonyms are allowed:
{cmd:masters} for {cmd:master}, 
{cmd:usings} for {cmd:using},
{cmd:matches} and {cmd:matched} for {cmd:match},
{cmd:match_updates} for {cmd:match_update}, 
and 
{cmd:match_conflicts} for {cmd:match_conflict}. 

{pmore}
    Using {cmd:assert(match master)} specifies that the merged file is
    required to include only matched master or using 
    observations and unmatched master observations, and may not 
    include unmatched using observations.  Specifying {cmd:assert()}
    results in {cmd:join} issuing an error if there are match results
    among those observations you allowed.

{pmore}
The order of the words or codes is not important, so all the following
{cmd:assert()} specifications would be the same:

{pmore2}
{cmd:assert(match master)}

{pmore2}
{cmd:assert(master matches)}

{pmore2}
{cmd:assert(1 3)}

{pmore}
    When the match results contain codes other than those allowed,
    return code 9 is returned, and the 
    merged dataset with the unanticipated results is left in memory
    to allow you to investigate.

{phang}
{cmd:keep(}{help join##results:{it:results}}{cmd:)}
    specifies which observations are to be kept from the merged dataset.
    Using {cmd:keep(match master)} specifies keeping only
    matched observations and unmatched master observations after merging.

{pmore}
    {cmd:keep()} differs from {cmd:assert()} because it selects
    observations from the merged dataset rather than enforcing requirements.
    {cmd:keep()}
    is used to pare the merged dataset to a given set of observations when
    you do not care if there are other observations in the merged dataset.
    {cmd:assert()} is used to verify that only a given set of observations
    is in the merged dataset.

{pmore}
   You can specify both {cmd:assert()} and {cmd:keep()}.  If you require 
   matched observations and unmatched master observations
   but you want only the matched observations, then you could specify
   {cmd:assert(match master)} {cmd:keep(match)}.


{phang}
{cmd:verbose}
    is a programmer command that will report Mata debug information


{marker examples}{...}
{title:Examples}

{pstd}Perform m:1 merge{p_end}

{inp}
    {hline 60}
    webuse nlswork
    replace year = 1900 + year
    join xrate, by(year) from(http://www.stata-press.com/data/r14/pennxrate.dta if country=="JPN")
    {hline 60}
{txt}


{marker about}{...}
{title:Author}

{pstd}Sergio Correia{break}
Board of Governors of the Federal Reserve System, USA{break}
{browse "mailto:sergio.correia@gmail.com":sergio.correia@gmail.com}{break}
{p_end}


{title:More Information}

{pstd}{break}
To report bugs, contribute, ask for help, etc. please see the project URL in Github:{break}
{browse "https://github.com/sergiocorreia/ftools"}{break}
{p_end}