help levenshtein
-------------------------------------------------------------------------------
Title

levenshtein -- Calculate the Levenshtein edit distance between two strings.

Syntax

Calculate the Levenshtein edit distance between two strings

levenshtein string1 string2

Calculate the Levenshtein edit distances between two string vectors

levenshtein varname1 varname2 [if] [in], generate(newvarname)

Description

levenshtein calculate the Levenshtein edit distance(s) between two strings or two vectors of strings. The Levenshtein edit distance is defined as the minimum number of insertions, deletions, or substitutions necessary to change one string into the other. For example, the Levenshtein edit distance between "mitten" and "fitting" is 3, since the following three edits change one into the other, and it is impossible to do it with fewer than three edits:

1. mitten -> fitten (substitution of 'f' for 'm')

2. fitten -> fittin (substitution of 'i' for 'e')

3. fittin -> fitting (insert 'g' at the end)

Examples

1. Calculate the Levenshtein edit distance between "mitten" and "fitting":

. levenshtein mitten fitting

2. Calculate the Levenshtein edit distance between two string vectors:

. sysuse auto, clear

. decode foreign, gen(foreign_string)

. levenshtein make foreign_string, gen(edit_dist)

Notes

levenshtein is implemented as a plugin in order to minimize memory requirements and to maximize speed. Unfortunately, plugins are specific to the hardware architecture and software framework of your computer, i.e., plugins are not cross-platform. Define a platform by two characteristics: machine type and operating system. Stata stores these characteristics in c(machine_type) and c(os), respectively. levenshtein supports the following platforms at this time:

Machine type Operating system PC Windows PC (64-bit x86-64) Unix Macintosh MacOSX Macintosh (Intel 64-bit) MacOSX

Saved results

r(distance) The Levenshtein edit distance

Acknowledgements

The algorithm used to calculate the Levenshtein edit distance is based on a Python extension written by David Necas (Yeti). His code is publicly available at http://code.google.com/p/pylevenshtein/source/browse/trunk/Levenshtein.c.

Thanks to James Beard for helpful suggestions and feedback.

Author

Julian Reif, University of Chicago

jreif@uchicago.edu

Also see

strgroup