help sq (SJ6-4: st0111) -------------------------------------------------------------------------------

Title

sq -- Sequence analysis

Description

The term sq refers to sequence data and to the commands for analyzing these data. Sequences are entities built by a limited number of elements that are ordered in a specific way. A typical example is human DNA, where the elements adenine, cytosine, guanine, and thymine (the organic bases) are ordered into a sequence. Other sequences are songs that are built by tones that appear in a specific order, or careers of employers that are built by specific job positions and ordered along time.

Sequence data are data that contain one variable holding the elements, one variable that contains the position of each element within this sequence, and one variable that identifies the sequences itself. Hence, sequence data require data that are set up in what Stata usually calls the "long form" and which is explained in some detail below.

The SQ-Ados are a bundle of commands to describe, analyze, and group the sequences of a sequence dataset. The following sq commands are available:

sqset Declare data to be sequence data sqdes Describe sequence concentration sqtab Tabulate sequences sqegen Generate variables reflecting entire sequences sqstat Describe, summarize, and tabulate sq-egenerated variables' sequences sqindexplot Graph sequences as sequence index plot sqparcoord Graph sequences with parallel coordinates sqom Optimal matching of sequences sqclusterdat Prepare a dataset to perform cluster analyses on the results of sqom sqclustermat Perform one cluster analysis on the results of sqom

You begin an analysis by sqsetting your data, which tells Stata the key sequence data variables; see sqset. Once you have sqset your data, you can use the other sq commands. If you save your data after sqsetting it, you will not have to sqset it again in the future; Stata will remember it.

Please refer to sqdemo for a quick demonstration of the sq commands.

Remarks

Remarks are presented under the following headings:

Sequences and how they are stored Typical research questions Gaps, missings, etc. Limitations

Sequences and how they are stored

An example for a sequence is the following chain of letters:

A-G-C-T-T-T-T-G-C-A

In this example, the letters might stand for something else, such as the four organic bases adenine, guanine, cytosine, and thymine of DNA. The chain of letters might, however, also denote the tones of a song (using the letter "T" for a break), the employment states in a job career, party preferences during a lifetime, etc.

In what follows, we will use the term "sequence" for the entire chain, "element" for the states of the chain, and "position" for the position on which a specific element is found. Hence, in the sequence above, the element A is at positions 1 and 10, G is at position 2 and 8, etc.

In Stata, sequences can be stored in two formats. The first format is the wide form. Sequence data in wide form store sequences underneath each other with one variable for each position. Here is an example:

(wide form)

id bas1 bas2 bas3 bas4 ------------------------------- 1 A G C T 2 G C T A 3 C G T A

The second format has one variable that indicates the sequence, one variable that stores the position and another one that stores the elements. This is called the long form. In long form, the above example looks as follows:

(long form)

id pos bas ----------------------- 1 1 A 1 2 G 1 3 C 1 4 T 2 1 G 2 2 C 2 3 T 2 4 A 3 1 C 3 2 T 3 3 G 3 4 A

The sq-commands expect sequence data in long form. Toggling between wide and long form is easy with reshape.

Typical research questions

The first aim of sequence analysis is to describe the sequences. With a few short sequences, it is easy to describe the sequences by simply listing them, but in practice, there are usually many sequences that tend to be rather long. It is therefore necessary to have some specific tools that allow describing many long sequences effectively. Among the sq-commands sqgen, sqstat, sqtab, sqparcoord, and sqindexplot might be useful for this task.

The second aim of sequence analysis is to find certain similarities of sequences. The similarity of sequences has to be defined a little further, however. Look, for example, at the following sequences, presented in wide form:

id bas1 bas2 bas3 bas4 ------------------------------- 1 A G C T 2 G C T 3 A G

First, the three sequences have different length. In terms of length, sequence 1 is more similar to sequence 2 than to sequence 3. If one, however, compares the elements at each position, sequences 1 and 3 have the same elements at the first two positions, and they differ only in that sequence 1 is longer than sequence 3. Sequence 2 has different elements at each position from those of the two other sequences. Hence, sequences 1 and 3 are more similar than 1 and 2 in this respect. Finally, in a third respect, sequences 1 and 2 are quite similar. They differ only in that sequence 1 starts with "A". If we delete the first position from sequence 1, or insert "A" at the beginning of the second sequence, both sequences would be identical. All tools to describe the sequences can be also used to find similarities between the sequences in one respect or another. However, one of the sq-commands, sqom, is specially aimed to find similarities in the third respect.

Finally, if one has been able to depict certain typical sequences, one might be interested in using sequence types as independent variables in statistical models. Biostatisticians might be interested in whether specific types of DNA sequences affect behavior or appearance of species, and social scientists might be interested in whether certain types of educational careers cause dangerous job situations. The sq-commands therefore allow building variables for grouping similar sequences together.

Gaps, missings, etc.

If an element at a certain position is unknown, we call this a gap. Gaps theortically can appear at the beginning, and/or in the middle of a sequence, and we treat them differently.

For sequence analysis, gaps create several problems -- not so much in terms of technical problems but in terms of content. The way one deals with gaps influences the substantial outcomes of sequence analysis, and it depends on the research questions, which way of dealing with gaps is the most appropriate. The SQ-Ados are generally designed such that sequences that contain a gap in the middle are not used in the analysis; however, they can be included in some of the programs by using the option gapinclude.

Unknown elements at the beginning or the end of a sequence are generally not counted as a "gap". We do, however, recommend erasing them with the options ltrim, rtrim, or trim of the command sqset.

Taking care of gaps is mainly up to the user. To guide the user through his decisions, sqset will control for gaps and propose ways to deal with them. In this section, we will explain the various ways to deal with gaps in more detail.

In sequence data, gaps can appear two ways. Presented in long form, the first way is shown here:

id pos bas ----------------------- 1 1 A 1 2 1 3 C 1 4 T

There is no entry (or a missing), at position 2. In other words, one does not know the element at position 2. The other way to represent gaps is to erase an observation from the data:

id pos bas ----------------------- 1 1 A 1 3 C 1 4 T

Although both ways seem to represent the same information, we let you sqset the data only if gaps are represented in the first way. With sqset, an error message will appear if gaps appear in the second form. To proceed, you need to restructure your data such that gaps either appear the first way or do not appear at all. Hence, you might go on with

. fillin id pos

which will bring you to the first way (as long as there is at least one sequence without a gap at the second position), or you restructure the variable holding the positions of the elements by stating something like

. bysort id: replace pos = _n

After having decided how to deal with the "forbidden" gaps, one can sqset the data. However, if there are still gaps of the first variety in the data, sqset will display a note accordingly. You then have several choices. The first choice always is to simply ignore the note and to let the sq-commands deal with gaps however they like. The second choice is to encode missings to a meaningful value. Hence, you define the missing to be just another element:

. replace base = "M" if base == ""

The third choice is to keep only the longest available section of each sequence that is not interrupted by gaps. This can be achieved with the option keeplongest of sqset.

Limitations

For the SQ-Ados, sequence data are expected to be in long format, which imposes no restrictions with respect to sequence length. Much of the programming within the SQ-Ados is, however, done in wide format, so that the maximum sequence length is somewhat less than the number of variables allowed in the respective flavor of Stata (32,000 in Stata/SE and 2,047 in Intercooled Stata).

The command sqom with option full stores its results by pushing a Mata matrix into a Stata matrix. The maximum dimension of the Stata matrix is 11,000 x 11,000. The flavor of Stata and the matsize plays no role for this restriction.

Given the limits and speed problems, optimal matching as it is implemented in sqom seems capable of working with a moderate number of relatively short sequences. It has been tested for around 2,000 sequences of length up to 100 positions.

Authors

Ulrich Kohler, WZB, kohler@wz-berlin.de Magdalena Luniak, WZB, luniak@wz-berlin.de Christian Brzinsky-Fay, WZB, brzinsky-fay@wz-berlin.de

Bug reports go to Ulrich Kohler. Questions on applications of sequence analysis are handled by Christian Brzinsky-Fay.

Also see

Manual: [D] reshape

Online: sq, sqdemo, sqset, sqdes, sqegen, sqstat, sqindexplot, sqparcoord, sqom, sqclusterdat, sqclustermat