------------------------------------------------------------------------------- Sequence analysis using Stata -- software demonstration -------------------------------------------------------------------------------

Ulrich Kohler, WZB kohler@wzb.eu

Christian Brzinsky-Fay, WZB brzinsky-fay@wzb.eu

Magdalena Luniak, WZB

------------------------------------------------------------------------------- Contents -------------------------------------------------------------------------------


o Long versus wide o sqset the data

The egen functions

o Generate variables with descriptive information on sequences o Display contents of e-generated variables

Descriptive tables

o Tabulate sequences o Describe sequences

Graph sequences

o Parallel-coordinates plot o Sequence index plots

Optimal matching o The sqom command o Accessing the results o On speed

------------------------------------------------------------------------------- Long versus Wide -------------------------------------------------------------------------------

Sequences are entities of their own; i.e., one thinks about sequences in "wide" form, and that is how datasets are usually structured.

. use http://www2000.wzb.eu/~kohler/ado/youthemp, clear

. list id st1-st10 in 1/10

Many Stata users prefer sequences in "long" form. The programs are therefore written for data in long form. Hence, to use the programs, one has to reshape from wide to long.

. reshape long st, i(id) j(order)

. list id st in 1/10

------------------------------------------------------------------------------- sqset the data -------------------------------------------------------------------------------

To work with the SQ-Ados, one has to sqset the data. This command works similar to tsset, stset, or xtset.


sqset elementvar idvar ordervar [, trim rtrim ltrim keeplongest]

sqset [, clear]

Example . sqset st id order

Among other things, sqset checks for gaps, confirms integer order and uniqueness of sequence-IDs, and confirms order.

------------------------------------------------------------------------------- Generate variables with summary descriptions -------------------------------------------------------------------------------

The SQ-egen functions are used to generate variables that hold a summary description of each sequence.

General usage

egen [type] newvar = sqfcn() [, options]

Examples . egen seqlen = sqlength() <- Overall length of sequence . egen dur1 = sqlength(), element(1) <- Overall length of sequence of elemen > t 1 . egen gaplen = sqgaplength() <- Length of gaps . egen gapcount = sqgapcount() <- Number of episodes with gaps . egen elemnum = sqelemcount() <- Number of different elements in sequ > ence . egen chnum = sqepicount() <- Number of episodes . egen epi1 = sqepicount(), element(1) <- Number of episodes of element 1

. describe

Stata keeps track of all variable names that are generated with the SQ-egen functions. Other SQ-commands automatically use the e-generated variables. The names of the e-generated variables are stored as together with the dataset.

------------------------------------------------------------------------------- Display contents of e-generated variables -------------------------------------------------------------------------------

The sqstat bundle provides convenient displays for the variables generated with the SQ-egen functions.

List features of sequences

. sqstatlist if sex . sqstatlist dur1 elemnum chnum, ranks(1/10)

. preserve . sqstatlist sex dur1, replace . describe . tab sex, sum(dur1) . restore

Summarize features of sequences

. sqstatsum . sqstatsum dur1 epi1 if sex

Tabulate features of sequences

. sqstattab1 . sqstattab1 dur1 gaplen

. sqstattab2 elemnum sex

. sqstattabsum sex . sqstattabsum sex, sum(dur1)

------------------------------------------------------------------------------- Tabulate sequences -------------------------------------------------------------------------------

sqtab is used to produce a frequency table of the sequences in the dataset.


sqtab [varname] [if] [in] [, ranks(numlist) se so nosort gapinclude tabulate_options]


. sqtab . sqtab, ranks(1/10)

"Same order" and "Same elements"

sqtab allows a simple definition of similarity of sequences. With the option so, all sequences that have the same order of elements are collapsed together. The option se collapses sequences that consist of the same elements.

. sqtab, so . sqtab, se

------------------------------------------------------------------------------- Describe sequences -------------------------------------------------------------------------------

sqdes produces a descriptive overview of the sequences in the dataset. More specifically, it shows

o the number of elements observable over all sequences (k),

o the maximum length of the sequences (l),

o the number of possible sequences that might be formed with k elements of length l,

o the number of different sequences in the dataset, and

o the number of sequences that are shared by ... persons


sqdes [if] [in] [, so se graph gapinclude]


. sqdes . sqdes, so . sqdes, se graph

------------------------------------------------------------------------------- Graph sequences as parallel-coordinates plot -------------------------------------------------------------------------------

sqparcoord produces parallel-coordinates plots of the sequences in the dataset. In its simplest form, such plots are useful only for very small numbers of sequences. Therefore, sqparcoord provides several options to produce meaningful displays even with larger numbers of sequences.


sqparcoord [if] [in] [, ranks(numlist) so offset(#) wlines(#) gapinclude twoway_options]


. sqparcoord <- All sequences (useless) . sqparcoord, ranks(1/10) offset(.5) <- 10 most frequent sequences, with off > set . sqparcoord, wlines(7) <- Plot frequent sequences much thicker >

. sqparcoord, so ranks(1/10) offset(.5) <- Using "same order" sequences . sqparcoord, so wlines(7) <- Plot frequent sequences much thicker >

------------------------------------------------------------------------------- Graph sequences as sequence index plot -------------------------------------------------------------------------------

sqparcoord produces a sequence index plot (Brüderl and Scherer 2006). In these plots, the episodes of the sequences are plotted as stacked horizontal bars with colors to separate the different elements.

As stressed elsewhere, the results of sequence index plots depend on the order the sequences in the graph. A simple algorithm is used to order of the sequences in the plot, but results of more sophisticated algorithms can also be used (for example, results from sqom).


sqindexplot [if] [in] [, ranks(numlist) se so order(varname) by( varname) color(colorstyle) gapinclude twoway_options]


. sqindexplot, color(blue green black yellow red) . sqindexplot, ranks(1/10) . sqindexplot, so . sqindexplot, se


With sequence index plots, one might overstate the frequency of elements on "high" levels. This can be minimized by (a) decent ordering and (b) tuning the aspect ratio.

------------------------------------------------------------------------------- Perform optimal matching -------------------------------------------------------------------------------

sqom performs a cluster analysis of sequences on the basis of a distance matrix produced by the Needleman-Wunsch algorithm. It allows free specification of "Indel" and "substitution" cost, as well as different kinds of standardizations. Results are stored for later use.


sqom [if] [in] [, indelcost(#) subcost(#|rawdistance|matexp|matname)} name(varname) refseqid(spec) full k(#) standard(#|cut|longer|longest|none)


. sqom <- Default: Indel = 1, subcost = 2 . sqom, indelcost(3) <- Indel = 3, subcost = indelcost*2 . sqom, subcost(rawdistance) <- Indel = 1, subcost = abs(value1-value2)

. matrix sub = 0,8,7,3,2\8,0,8,7,3\7,8,0,8,7\3,7,8,0,7\2,3,7,7,0 . sqom, subcost(sub) <- subcosts from matrix "sub"

. sqom, standard(cut) <- cut at length of shortest . sqom, standard(6) <- cut at length of 6 . sqom, standard(longer) <- divide by the longer of two

. sqom, full k(2) <- full dissimilarity matrix

------------------------------------------------------------------------------- Accessing results of optimal matching -------------------------------------------------------------------------------

Results from sqom can be accessed for further analysis. Distances are either saved as a variable or as a Stata matrix named SQdist. The convenience programs sqclusterdat and mdsadd helps adding results of cluster analyses and/or multidimensional scaling to the sequence data.


. sqom, name(om1) . describe om1 . sqindexplot, order(om1)

. sqom, full k(2) . matrix dir . sqclusterdat . clustermat wardslinkage SQdist, name(myname) add . cluster tree myname, cutnumber(20) . sqclusterdat, return

. mdsmat SQdist . predict mdsdim1, saving(mds) . sqmdsadd using mds