-------------------------------------------------------------------------------
help for reclink
-------------------------------------------------------------------------------

Record Linkage

reclink varlist using filename , idmaster(varname) idusing(varname) gen(newvarname) [ wmatch(match weight list) wnomatch(non-match weight list) orblock(varlist) required(varlist) exactstr(varlist) exclude(filename) _merge(newvarname) uvarlist(varlist) uprefix(text) minscore(#) minbigram(#)

Description

reclink uses record linkage methods to match observations between two datasets where no perfect key fields exist -- essentially a fuzzy merge. reclink allows for user-defined matching and non-matching weights for each variable and employs a bigram string comparator to assess imperfect string matches.

The master and using datasets must each have a variable that uniquely identifies observations. Two new variables are created, one to hold the matching score (scaled 0-1) and one for the merge variable. In addition, all of the matching variables from the using dataset are brought into the master dataset (with newly prefixed names) to allow for manual review of matches.

To enhance the speed of this often slow procedure, or-blocking can be used which requires at least one variable to match perfectly between datasets. Or-blocking is the default if 4 or more variables are specified.

Note and Warning

In general, record linkage methods are imperfect and results should be manually reviewed, especially for observations with lower matching scores. It is not uncommon to try several runs with a variety of weights, orblock options, and derived variables to increase the accuracy of the linkage. A series of reclink commands can be used with the help of the exclude option.

Required Options

idmaster(varname) is required and specifies the name of a variable in the master dataset that uniquely identifies the observations. This variable is used to track observations. If a unique identifer does not exist, one can be created simply as gen idmaster=_n.

idusing(varname) is required and specifies the name of a variable in the using dataset that uniquely identifies the observations analogous to idmaster.

gen(newvarname) is required and specifies the name of a new variable created by reclink to store the matching scores (scaled 0-1) for the linked observations.

Common Options

wmatch(numlist) specifies the weights given to matches for each variable in varlist. Each variable requires a weight, although a default of 1 will be used for all variables if not specified. Weights must be >=1 and are typically integers from 1 to 20. The values should reflect the relative likelihood of a variable match indicating a true observation match. For example, a name variable will often have a large weight such as 10 but a city variable, where many duplicates are expected, may have a weight of just 2.

wnomatch(numlist) specifies the weights given to mismatches for each variable in the varlist. These weights are analogous to wmatch weights, but instead reflect the relative likelihood that a mismatch on a variable indicates that the observations don't match -- a small value indicates that mismatches are expected even if the observations truly match. A variable such as telephone number may have a large wmatch but a small wnomatch because matches are unlikely to occur randomly, but mismatches may be fairly common due to changes in phone numbers over time or multiple phone numbers owned by the same person/entity.

orblock(varlist | none) is used to speed up the record linkage by providing a method for selecting only subsets of observations from the using dataset to search for matches. Only observations that match on at least one variable in the Or-Block are examined. Or-blocking on the full varlist is the default behavior if there are 4 or more variables specified. This default can be overriden by specifying orblock(none), which is advised if all of the variables are expected to be fairly unique. New variables are sometimes created in the master and using datasets to assist with Or-Blocking, such as initials of first and last names, street numbers extracted from addresses, and telephone area codes. Or-Blocking can dramatically improve the speed of reclink.

required(varlist) allows the user to specify one or more variables that must match exactly for the observation to be considered a match. The variable(s) must also be in the main varlist and are included in the matching score. This option could have been named andblock to make it's function clear in relation to orblock.

exclude(filename) allows the user to specify the name of a file that contains previously matched observations, providing a convenient way to use reclink repeatedly with different specifications. The exclude file must include the variables specified in idmaster and idusing. Any observation with non-missing values for both id variables is considered matched and is excluded from the datasets for the current matching. Results from each run of reclink can be appended together and specified as the exclude file. This approach can speed up the matching process by starting with more restrictive orblock and/or required specifications that work quickly, followed by a more exhaustive and slow search for the more difficult observations.

Less Commonly Used Options

_merge(varname) specifies the name of the variable that will mark the source of each observation. The default name is _merge(_merge).

exactstr(varlist) allows the user to specify one or more string variables where the bigram string comparator is not used to assess the degree of agreement, but instead the agreement is simply 0 or 1.

uvarlist(varlist) allows the using dataset to have different variable names than the master dataset for the variables to be matched. If specified, the uvarlist must have the same number of variables in the same ordering as the master varlist.

uprefix(string) allows changing the prefix used for renaming the variables in the matching varlist that are brought into the master dataset from the using dataset. The default uprefix is U. For example, if the matching variables are name and address, then the resulting dataset will have variables Uname and Uaddress added from the using dataset for the matching observations.

minscore(#) specifies the minimum overall matching score value (0-1) used to declare two observations a match, default=0.6. Observations in the using dataset are only merged into the master dataset if they have a match score>=minscore and are the highest match score in the using dataset. Lower values of minscore will expand the number of matches but may lead to more false matches.

minbigram(#) specifies the bigram value needed to declare two strings as likely matched, default=0.6. Each raw bigram score is transformed into match and non-match weight multipliers that vary from 0 to 1 with a sharp change at minbigram. A higher value of minbigram may be useful when matching longer strings.

Example

. reclink fname lname address zip phone using bigset, gen(myscore) idm(id) idu(recno) wmatch(3 8 10 2 8) wmnomatch(4 5 8 4 2)

finds matches between current dataset and bigset based on 5 variables. Uses orblocking by default so that only records that match on fname or lname or address or zip or phone will be examined. Could specify orblock(none) to widen possible matches but much slower. Could gen initials=substr(fname,1,1)+substr(lname,1,1) in both datasets and then add initials to the varlist to increase likelihood that or-blocking will work.

Author

Michael Blasnik michael.blasnik@verizon.net

Also see

Online: help for merge, joinby