strgroup -- Match strings based on their Levenshtein edit distance.
Syntax
strgroup varname [if] [in] , generate(newvarname) threshold(#) [first normalize([shorter|longer|none]) noclean force]
by is allowed; see help by.
Description
strgroup matches similar strings together. This can be useful when merging data that contain typos. For example, "widgets" will not merge with "widgetts" because the strings are not identical. strgroup provides a way to match strings in an objective and automated manner. It employs the following algorithm:
1. Calculate the Levenshtein edit distance between all pairwise combinations of strings in varname.
2. Normalize the edit distance as specified by normalize([shorter|longer|none]). The default is to divide the edit distance by the length of the shorter string.
3. Match a string pair if their normalized edit distance is less than or equal to the user-specified threshold.
4. If string A is matched to string B and string B is matched to string C, then match A to C.
5. Assign each group of matches a unique number and store this result in newvarname.
For example, the Levenshtein edit distance between "widgets" and "widgetts" is 1. The lengths of these two strings are 7 and 8, respectively. Assuming normalize(shorter), they are matched by strgroup if 1/7 <= threshold.
strgroup is case sensitive.
See levenshtein for an explanation of the Levenshtein edit distance.
Options
generate(newvarname) specifies the name of a new variable to store the results.
threshold(#) sets the threshold level for matching.
first instructs strgroup to only match strings that share the same first character. This typically reduces the amount of time required for strgroup to run by several orders of magnitude, at the cost of perhaps incorrectly not matching strings. For example, "widgets" and "qidgets" will not be matched if you specify first because they do not begin with the same character.
normalize([shorter|longer|none]) is used to define the normalization of the Levenshtein edit distance. With shorter all edit distances are divided by the length of the shorter string in the pair; this is also the default. longer divides the edit distance by the length of the longer string. none specifies that no normalization is needed.
noclean instructs strgroup not to trim leading and trailing blanks when comparing string pairs. Trimming can reduce run time.
force forces strgroup to run even if when comparing more than 10,000 observations. This may take a while and may cause memory problems if your dataset is too large.
Notes
strgroup does not match missing strings.
As explained above, strgroup calculates the Levenshtein edit distance between all pairwise combinations of strings in varname. Let N be the number of observations being compared. Then the amount of memory and number of calculations required by strgroup is proportional to (N)(N-1)/2, an expression that increases with the square of N. Thus, large datasets need to be divided into subsets in order to facilitate calculations. The first option automates this by subsetting strings according to their first characters. Alternatively, the user can run strgroup on subsets of the data by using the if, in and/or by options.
strgroup is implemented as a plugin in order to minimize memory requirements and to maximize speed. Unfortunately, plugins are specific to the hardware architecture and software framework of your computer, i.e., plugins are not cross-platform. Define a platform by two characteristics: machine type and operating system. Stata stores these characteristics in c(machine_type) and c(os), respectively. strgroup supports the following platforms at this time:
Machine type Operating system PC Windows PC (64-bit x86-64) Windows PC (64-bit x86-64) Unix Macintosh MacOSX Macintosh (Intel 64-bit) MacOSX
Example
Merge two datasets together and identify potential matches that didn't merge.
. sysuse auto, clear . tempfile t . keep make price . replace make = make + "a" in 5 . replace make = "gibberish" in 10 . save "`t'" . sysuse auto, clear . keep make . merge make using "`t'", sort . list if _merge!=3 . strgroup make if _merge!=3, gen(group) threshold(0.25) . list if _merge!=3
Acknowledgements
The code used to calculate the Levenshtein edit distance is based on a Python extension written by David Necas (Yeti). His code is publicly available at http://code.google.com/p/pylevenshtein/source/browse/trunk/Levenshtein.c.
Thanks to Dimitriy Masterov for compiling the 64-bit Windows version of the plugin.
Author
Julian Reif, University of Chicago
jreif@uchicago.edu
Also see
levenshtein, regexm