{smcl}
{* *! version November 14, 2016 @ 18:17:52}{...}
{* link to other help files which could be of use}{...}
{vieweralsosee "sqclusterdat" "help sqclusterdat "}{...}
{vieweralsosee "sqdes" "help sqdes "}{...}
{vieweralsosee "sqegen" "help sqegen "}{...}
{vieweralsosee "sqindexplot" "help sqindexplot "}{...}
{vieweralsosee "sqmdsadd" "help sqmdsadd "}{...}
{vieweralsosee "sqmodalplot" "help sqmodalplot "}{...}
{vieweralsosee "sqom" "help sqom "}{...}
{vieweralsosee "sqpercentageplot" "help sqpercentageplot "}{...}
{vieweralsosee "sqset" "help sqset "}{...}
{vieweralsosee "sqstat" "help sqstat "}{...}
{vieweralsosee "sqstrlev" "help sqstrlev "}{...}
{vieweralsosee "sqstrmerge" "help sqstrmerge "}{...}
{vieweralsosee "sqtab" "help sqtab "}{...}
{viewerjumpto "Syntax" "sqstrlev##syntax"}{...}
{viewerjumpto "Description" "sqstrlev##description"}{...}
{viewerjumpto "Options" "sqstrlev##options"}{...}
{viewerjumpto "Remarks" "sqstrlev##remarks"}{...}
{viewerjumpto "Examples" "sqstrlev##examples"}{...}
{viewerjumpto "Saved Results" "sqstrlev##saved_results"}{...}
{viewerjumpto "Acknowledgements" "sqstrlev##acknowledgements"}{...}
{viewerjumpto "Author" "sqstrlev##author"}{...}
{viewerjumpto "References" "sqstrlev##references"}{...}
{...}
{title:Title}
{phang}
{cmd:sqstrmerge} {hline 2} Merge files using closest matches according to Levenshtein
{p_end}
{marker syntax}{...}
{title:Syntax}
{* put the syntax in what follows. Don't forget to use [ ] around optional items}{...}
{p 8 17 2}
{cmd: sqstrmerge}
{1:1|m:1|1:m}
{varlist}
{using}
{cmd:,} max(#)
[
{it:options}
]
{p_end}
{* the new Stata help format of putting detail before generality}{...}
{synoptset 20 tabbed}{...}
{synopthdr}
{synoptline}
{syntab:sqom-options}
{synopt:{opt indel:cost(#)}}set indel costs to {it:#}{p_end}
{synopt:{cmdab:sub:cost(}{it:#}|{it:implied formula}|{it:matexp}|{it:matname}{cmd:)}}specify substitution costs{p_end}
{synopt:{cmdab:st:andard(}{it:#}|{cmd:cut}|{cmd:longer}|{cmd:longest}|{cmd:none)}}standardization of sequences of different length{p_end}
{synopt:{opt k(#)}}restrict indels (to save calculation time){p_end}
{syntab:additional options}
{synopt:{opt ignore:case}}ignore case of letter{p_end}
{synopt:{opt asciiletters:only}}sets all non-ascii-letters to missing{p_end}
{synopt:{opt sound:ex}}caluculate distance on soundex transformed strings{p_end}
{syntab:merge options}
{synoptline}
{p2colreset}{...}
{marker description}{...}
{title:Description}
{pstd} {cmd:sqstrmerge} merges the master and the using data by
finding the closest match of the key variable in the using data. The
closest match is defined as the match that have the smalles Levenshtein
distance. The command merges close mathches only up to the Levenshtein
distance specified in the option max(){p_end} {p_end}
{marker options}{...}
{title:Options}
{phang} {opt max(#)} do not macht observations with a Levenshtein distance larger than the number specified. This option is requried. The default is {cmd:indelcost(1)}.
{phang} {opt indelcost(#)} specifies the cost attached to an
insertion or deletion of an alignment. The default is {cmd:indelcost(1)}.
{phang} {cmdab:sub:cost(}#|{it:implied
formula}|{it:matexp}|{it:matname}} specifies the cost attached to a
substitution in an alignment. Substitution costs may be specified as
real number, or as full matrix. Specifying
substitution cost as, for example, {cmd:subcost(3)} will attach the
cost of 3 to any substitution necessary in an alignment, regardless of
how similar the substituted values may be. The default is two times
the value specified as indel cost. A full substitution cost matrix
can be specified either by specifying the name of a matrix containing
the substitution cost or by typing valid matrix syntax into the option
itself. The matrix has to be a symmetric n*n matrix, where n is the
number of different elements in all sequences.
{phang}
{cmdab:st:andard(}#|{cmd:cut}|{cmd:longer}|{cmd:longest}|{cmd:none)}
is used to define the standardization of the resulting distances. With
{cmd:standard(#)} all sequences are cut to the length {it:#}. The keyword
{cmd:cut} automatically cuts all sequences to the length of the
shortest sequence in the dataset. {cmd:standard(longer)} divides all
distances by the length of the longer sequence of the respective
alignment. {cmd:standard(longest)} divides all distances by the length
of the longest sequence in the dataset; this is the
default. {cmd:none} is specified if no standardization is needed.
{phang}{opt k(#)} is used to speed up the calculation of the
optimal matching algorithm. Within the parentheses, an integer
positive number between 1 and the number of positions of the longest
sequence can be given. The speed up will be higher with small numbers.
Very small numbers can have the effect that the algorithm doesn't find
the best alignment between some sequences, and this problem tends to
increase if substitution costs are high relative to indel
costs. The option is ignored when option {cmd:sadi()} is specified.
{p 8 8 2}Note: The implementation of the {cmd:k()} is based partly on the
source code of TDA, written by Goetz Rohwer and Ulrich Poetter. TDA is
a very powerful program for transitory data analysis. It is programmed
in C and distributed as freeware under the terms of the General
Public License. It is downloadable from
{browse "http://www.stat.ruhr-uni-bochum.de/tda.html"}.
{phang}{opt ignorecase} ignore case of letter so that a=A, b=B ... z=Z, and vice versa. {p_end}
{phang}{opt asciiletters:only}} All non-ascii-letters are removed from the string if this option is specified. {p_end}
{synopt:{opt sound:ex}}calculate distance on soundex transformed
strings. See {help soundex()} for an explanation of the
soundex-transformation{p_end}
{marker examples}{...}
{title:Example(s)}{* Be sure to change Example(s) to either Example or Examples}
{phang}{cmd:. sqstrmerge 1:1 prename, max(2)}
{p_end}
{marker author}{...}
{title:Author}
{pstd}
Ulrich Kohler
email: {browse "mailto:ukohler@uni-potsdam.de":ukohler@uni-potsdam.de}
{p_end}