Tabulate string variables split into parts
tabsplit strvar [if exp] [in range] [ , characters parse(parse_strings) [no]trim tabulate_options ]
Description
tabsplit tabulates frequencies of occurrence of the parts of a string variable. By default, the parts of a string are separated by spaces. The parts of "A B C" are thus "A", "B" and "C". Optionally, alternative parsing strings may be specified. The parts of "A,B,C" with parse(,) are, again, "A", "B" and "C". The parts of "A B C" with parse(,) are just the single part "A B C". The idea of a part thus generalises Stata's concept of a word.
Remarks
Suppose data are gathered on modes of transport used in the journey to work. In addition to values of "car", "cycle", "foot", and so forth, there may be multiple values such as "car train tube foot" for people who use two or more modes. Within the limits in your version of Stata such single or multiple values may be stored as string variables. It may then be desired, for example, to count the individual modes used. tabsplit is designed for this special problem.
By default, leading and trailing spaces are ignored. Thus, string values that equal one or more spaces are treated just as if they were missing. Also with " 1, 2, 3" and parse(,) the parts are "1", "2" and "3".
Options
characters specifies that strings are to be split into separate characters. Thus strings such as "ABCDE" and "ABC" will be split so that the frequencies of "A", "B", etc. will be tabulated. parse() is ignored if characters is specified.
parse(parse_strings) specifies that, instead of spaces, parsing should be done using one or more parse_strings. Most commonly, one string which is a single punctuation character will be specified. For example, if parse(,) is specified, then "1,2,3" is split into "1", "2" and "3".
It is also possible to specify (1) two or more strings which are alternative separators of parts and/or (2) strings which consist of two or more characters. Alternative strings should be separated by spaces and strings which include spaces should be bound by " ". Thus if parse(, " ") is specified, then "1,2 3" is also split into "1", "2" and "3". Note particularly the difference between (say) parse(a b) and parse(ab): with the first, "a" and "b" are both acceptable as separators, while with the second, only the string "ab" is acceptable.
notrim specifies that the original string variable should not be trimmed of leading and trailing spaces before being parsed, and that the parts should not be trimmed similarly before being tabulated. notrim is not considered compatible with parsing on spaces, as the latter implies that spaces in a string are to be discarded: either specify parse strings or by default allow a trim.
tabulate_options are options of tabulate with one variable. The most useful in practice is sort. Note that the table is based on a temporary dataset which does not remain in memory after tabsplit has finished.
Examples
. tabsplit authors, parse(,) sort
Author
Nicholas J. Cox, University of Durham, U.K. n.j.cox@durham.ac.uk
Also see
On-line: help for split, tabulate