{smcl} help {hi:strgroup} {hline} {title:Title} {p 4 4 2}{cmd:strgroup} {hline 2} Match strings based on their Levenshtein edit distance. {title:Syntax} {p 4 4 2}{cmd:strgroup} {it:varname} [if] [in] , {cmdab:gen:erate(}{it:newvarname}{cmd:)} {cmdab:thresh:old(}{it:#}{cmd:)} [{cmd:first} {cmdab:norm:alize([shorter|longer|none])} {cmd:noclean} {cmd:force}] {p 2 2 1}{cmd:by} is allowed; see help {help by}. {title:Description} {p 4 4 2}{cmd:strgroup} performs a fuzzy string match using the following algorithm: {p 8 14 2}1. Calculate the Levenshtein edit distance between all pairwise combinations of strings in {it:varname}. {p 8 14 2}2. Normalize the edit distance as specified by {cmd:normalize([shorter|longer|none])}. The default is to divide the edit distance by the length of the shorter string. {p 8 14 2}3. Match a string pair if their normalized edit distance is less than or equal to the user-specified threshold. {p 8 14 2}4. If string A is matched to string B and string B is matched to string C, then match A to C. {p 8 14 2}5. Assign each group of matches a unique number and store this result in {it:newvarname}. {p 4 4 2}For example, the Levenshtein edit distance between "widgets" and "widgetts" is 1. The lengths of these two strings are 7 and 8, respectively. Assuming {cmd:normalize(shorter)}, they are matched by {cmd:strgroup} if 1/7 <= threshold. {p 4 4 2}See {help levenshtein:levenshtein} for an explanation of the Levenshtein edit distance. {title:Options} {p 4 8 2} {cmd:generate(}{it:newvarname}{cmd:)} specifies the name of a new variable to store the results. {p 4 8 2} {cmd:threshold(}{it:#}{cmd:)} sets the threshold level for matching. {p 4 8 2} {cmd:first} instructs {cmd:strgroup} to only match strings that share the same first character. This typically reduces the amount of time required for {cmd:strgroup} to run by several orders of magnitude, at the cost of perhaps incorrectly not matching strings. For example, "widgets" and "qidgets" will not be matched if you specify {cmd:first} because they do not begin with the same character. {p 4 8 2} {cmd:normalize([shorter|longer|none])} is used to define the normalization of the Levenshtein edit distance. With {cmd:shorter} all edit distances are divided by the length of the shorter string in the pair; this is also the default. {cmd:longer} divides the edit distance by the length of the longer string. {cmd:none} specifies that no normalization is needed. {p 4 8 2} {cmd:noclean} instructs {cmd:strgroup} not to trim leading and trailing blanks when comparing string pairs. Trimming can reduce run time. {p 4 8 2} {cmd:force} forces {cmd:strgroup} to run even if when comparing more than 10,000 observations. This may take a while and may cause memory problems if your dataset is too large. {title:Remarks} {p 4 4 2} {cmd:strgroup} does not match missing strings. {cmd:strgroup} is case sensitive. {p 4 4 2} The Levenshtein edit distance is calculated using byte-based comparisons, and some non-ASCII characters are larger than one byte in Unicode. For example, the edit distance between the Unicode characters '\$' and '£' is 2, not 1: {col 8}{cmd:. {stata levenshtein "\$" "£"}} {p 4 4 2} To avoid this issue, use Stata's string functions to convert multi-byte characters to single-byte characters: {col 8}{cmd:. {stata levenshtein "\$" "`=ustrto("£","latin1",1)'"} {title:Notes} {p 4 4 2} As explained above, {cmd:strgroup} calculates the Levenshtein edit distance between all pairwise combinations of strings in {it:varname}. Let N be the number of observations being compared. Then the amount of memory and number of calculations required by {cmd:strgroup} is proportional to (N)(N-1)/2, an expression that increases with the square of N. Thus, large datasets need to be divided into subsets in order to facilitate calculations. The {cmd:first} option automates this by subsetting strings according to their first characters. Alternatively, the user can run {cmd:strgroup} on subsets of the data by using the {cmd:if}, {cmd:in} and/or {cmd:by} options. {p 4 4 2} {cmd:strgroup} is implemented as a C {help plugin:plugin} in order to minimize memory requirements and to maximize speed. Plugins are specific to the hardware architecture and software framework of your computer, i.e., they are not cross-platform. Define a platform by two characteristics: machine type and operating system. Stata stores these characteristics in {cmd:c(machine_type)} and {cmd:c(os)}, respectively. {cmd:strgroup} supports the following platforms at this time: {col 10}{hi:Machine type}{col 40} {hi:Operating system} {col 10}PC{col 40} Windows {col 10}PC (64-bit x86-64){col 40} Windows {col 10}PC (64-bit x86-64){col 40} Unix {col 10}Macintosh{col 40} MacOSX {col 10}Macintosh (Intel 64-bit){col 40} MacOSX {title:Example} {p 4 4 2} Merge two datasets together and identify potential matches that didn't merge. {col 8}{cmd:. {stata sysuse auto, clear}} {col 8}{cmd:. {stata tempfile t}} {col 8}{cmd:. {stata keep make price}} {col 8}{cmd:. {stata replace make = make + "a" in 5}} {col 8}{cmd:. {stata replace make = "gibberish" in 10}} {col 8}{cmd:. {stata save "`t'"}} {col 8}{cmd:. {stata sysuse auto, clear}} {col 8}{cmd:. {stata keep make}} {col 8}{cmd:. {stata merge make using "`t'", sort}} {col 8}{cmd:. {stata list if _merge!=3}} {col 8}{cmd:. {stata strgroup make if _merge!=3, gen(group) threshold(0.25)}} {col 8}{cmd:. {stata list if _merge!=3}} {title:Acknowledgements} {p 4 4 2} The code used to calculate the Levenshtein edit distance is based on a Python extension written by David Necas (Yeti). His code is publicly available at {browse "http://code.google.com/p/pylevenshtein/source/browse/trunk/Levenshtein.c":http://code.google.com/p/pylevenshtein/source/browse/trunk/Levenshtein.c}. {p 4 4 2} Thanks to Dimitriy Masterov for compiling the 64-bit Windows version of the plugin. {title:Citation of strgroup} {p 4 4 2} {cmd:strgroup} is not an official Stata command. It is a free contribution to the research community. You may cite it as: {col 8} Reif, J., 2010. strgroup: Stata module to match strings based on their Levenshtein edit distance. {browse "http://ideas.repec.org/c/boc/bocode/s457151.html":http://ideas.repec.org/c/boc/bocode/s457151.html}. {title:Author} {p 4 4 2}Julian Reif, University of Illinois {p 4 4 2}jreif@illinois.edu {title:Also see} {p 4 4 2} {help levenshtein:levenshtein}, {help regexm:regexm}