```help sqom                                                        SJ6-4: st0111)
-------------------------------------------------------------------------------

Title

sqom -- Optimal matching of sequences

Syntax

sqom [if] [in] [, options]

options                                 Description
-------------------------------------------------------------------------
indelcost(#)                            set indel costs to #
subcost(#|implied formula|matexp|matname)
specify substitution costs
name(varname)                           specify substitution costs
refseqid(spec)                          select reference sequence
full                                    calculate full dissimilarity
matrix between sequences
k(#)                                    restrict indels (to save
calculation time)
standard(#|cut|longer|longest|none)     standardization of sequences of
different length
subsequence(a,b)                        use only subsequence between
positions a and b
idealtype(spec)                         compare with a specified ideal
typical sequence
-------------------------------------------------------------------------

Description

sqom performs optimal matching of sequences. The command uses the
Needleman-Wunsch algorithm to find the alignment between two sequences
that have the lowest Levenshtein distance. The Levenshtein distances are
then stored for further use.

By default, all sequences are compared to the most frequent sequence and
the resulting distances are stored in a variable. It is, however,
possible to compare all sequences with a preselected reference distance
or to compare all sequences with every other sequence. In the latter
case, the resulting distances are stored in a Stata variable.

Comparing all sequences with any other sequence is computationally
intensive.

Options

indelcost(#) specifies the cost attached to an insertion or deletion of
an alignment. The default is indelcost(1).

subcost(#|implied formula|matexp|matname} specifies the cost attached to
a substitution in an alignment. Substitution costs may be specified
as real number, as implied formula, or as full matrix.  Specifying
substitution cost as, for example, subcost(3) will attach the cost of
3 to any substitution necessary in an alignment, regardless of how
similar the substituted values may be. The default is two times the
value specified as indel cost.  A full substitution cost matrix can
be specified either by specifying the name of a matrix containing the
substitution cost or by typing valid matrix syntax into the option
itself.  The matrix has to be a symmetric n*n matrix, where n is the
number of different elements in all sequences.

implied formula generates substitution costs based on the data. The
implied formula is specified with a keyword. The following keywords
are allowed

implied formula
-------------------------------------------------------------------------
rawdistance           use the absolute value of the difference between
the numeric values representing the respective
elements
meanprobdistance      calculates symmetric substitiution cost matrix
based on the mean of the transitions'
probabilities (p) in the data between every two
neighboring elements in the sequences. The
substitution costs between elements x and y are
defined by: SC(x,y) = SC(y,x) = 2-p(x,y)-p(y,x)
if x is not equal to y, otherwise 0.
minprobdistance       calculates symmetric substitiution cost matrix
based on the transitions' probabilities (p) in
the data between every two neighboring elements
in sequences. The substitution cost matrix
contains the minimal substitution costs for each
pair of symmetric transitions: SC(x,y) = SC(y,x)
= min(1-p(x,y),1-p(y,x))*2 if x is not equal to
y, otherwise 0.
maxprobdistance       calculates symmetric substitiution cost matrix
based on the transitions' probabilities (p) in
the data between every two neighboring elements
in sequences. The substitution cost matrix
contains the maximal substitution costs for each
pair of symmetric transitions: SC(x,y) = SC(y,x)
= max(1-p(x,y),1-p(y,x))*2 if x is not equal to
y, otherwise 0.
-------------------------------------------------------------------------
The substitution costs in last three cases have values between 0 and
2

Specifying a full substitution cost matrix or generating a data based
substitution cost matrix can increase the running time of the program
considerably.  Option k() might be considered for sqom with full
substitution cost matrix.

name(varname) is used to specify the name of the variable in which the
distances are stored. If not specified, _SQdist is used. The
automatically generated distance variable will get overwritten
without warning whenever a sqom command without name() is invoked.

refseqid(spec) is used to select the reference sequence against which all
sequences in the dataset are being tested. Within the parentheses, an
existing value of the sequence identifier has to be stated.

full is used to perform optimal matching for all sequences in the dataset
against any other. The results of these comparisons are stored in the
distance matrix "SQdist". Specifying the option full will increase
the running time of the program considerably. Option k() might be
used for sqom with full.

Two companion programs, sqclusterdat and sqclustermat, help to
further analyze the distance matrix produced with sqom, full.

k(#) is used to speed up the calculation of the optimal matching
algorithm. Within the parentheses, an integer positive number between
1 and the number of positions of the longest sequence can be given.
The speed up will be higher with small numbers.  Very small numbers
can have the effect that the algorithm doesn't find the best
alignment between some sequences, and this problem tends to increase
if substitution costs are high relative to indel costs.

Note: The implementation of the k() is based partly on the source
code of TDA, written by Goetz Rohwer and Ulrich Poetter. TDA is a
very powerful program for transitory data analysis. It is programmed
in C and distributed as freeware under the terms of the General
http://www.stat.ruhr-uni-bochum.de/tda.html.

standard(#|cut|longer|longest|none) is used to define the standardization
of the resulting distances. With standard(#) all sequences are cut to
the length #.  The keyword cut automatically cuts all sequences to
the length of the shortest sequence in the dataset. standard(longer)
divides all distances by the length of the longer sequence of the
respective alignment. standard(longest) divides all distances by the
length of the longest sequence in the dataset; this is the default.
none is specified if no standardization is needed.

subsequence(a,b) is used to include only the part of the sequence that is
between position a and b, whereby a and b refer to the position
defined in the order variable.

idealtype(spec) allows to specify an ideal typical sequence against which
all sequences are compared. To specify the sequence use
element[:repetitions] [element:repetitions]. For example, with
idealtype(3:20 5 1:20 3:20) you specifiy an ideal typical sequence of
length 61. The ideal typical sequence starts with element 3 over 20
positions, followed by one position of elment 5, 20 positions of
element 1 and finally again 20 positions of element 3.

Authors

Ulrich Kohler, WZB, kohler@wz-berlin.de
Magdalena Luniak, WZB, luniak@wz-berlin.de

Examples

. sqom, name(mydist)
. sqindexplot, order(mydist)

. sqom, full k(2)
. sqclustermat ward, name(mydist2)
. sqindexplot, order(mydist2)

Also see

Online: sq, sqdemo, sqset, sqdes, sqegen, sqstat, sqindexplot,
sqparcoord, sqom, sqclusterdat, sqclustermat
```