levenshtein -- Calculate the Levenshtein edit distance between two strings.
Syntax
Calculate the Levenshtein edit distance between two strings
levenshtein string1 string2
Calculate the Levenshtein edit distances between two string vectors
levenshtein varname1 varname2 [if] [in], generate(newvarname)
Description
levenshtein calculate the Levenshtein edit distance(s) between two strings or two vectors of strings. The Levenshtein edit distance is defined as the minimum number of insertions, deletions, or substitutions necessary to change one string into the other. For example, the Levenshtein edit distance between "mitten" and "fitting" is 3, since the following three edits change one into the other, and it is impossible to do it with fewer than three edits:
1. mitten -> fitten (substitution of 'f' for 'm')
2. fitten -> fittin (substitution of 'i' for 'e')
3. fittin -> fitting (insert 'g' at the end)
Examples
1. Calculate the Levenshtein edit distance between "mitten" and "fitting":
. levenshtein mitten fitting
2. Calculate the Levenshtein edit distance between two string vectors:
. sysuse auto, clear
. decode foreign, gen(foreign_string)
. levenshtein make foreign_string, gen(edit_dist)
Notes
levenshtein is implemented as a plugin in order to minimize memory requirements and to maximize speed. Unfortunately, plugins are specific to the hardware architecture and software framework of your computer, i.e., plugins are not cross-platform. Define a platform by two characteristics: machine type and operating system. Stata stores these characteristics in c(machine_type) and c(os), respectively. levenshtein supports the following platforms at this time:
Machine type Operating system PC Windows PC (64-bit x86-64) Unix Macintosh MacOSX Macintosh (Intel 64-bit) MacOSX
Saved results
r(distance) The Levenshtein edit distance
Acknowledgements
The algorithm used to calculate the Levenshtein edit distance is based on a Python extension written by David Necas (Yeti). His code is publicly available at http://code.google.com/p/pylevenshtein/source/browse/trunk/Levenshtein.c.
Thanks to James Beard for helpful suggestions and feedback.
Author
Julian Reif, University of Chicago
jreif@uchicago.edu
Also see
strgroup