help for split

Splitting string variables into parts

split strvar [if exp] [in range] [, generate(stub) parse(parse_strings) limit(#) [no]trim destring destring_options ]


split splits the contents of a string variable strvar into one or more parts, using one or more parse_strings (by default blank space(s)), so that new string variables are generated. It is thus useful for separating `words' or parts of a string variable. strvar itself is not modified.

Options generate(stub) specifies the beginning characters of the new variable names, so that new variables stub1, stub2, etc., are produced. stub defaults to strvar.

parse(parse_strings) specifies that, instead of spaces, parsing should be done using one or more parse_strings. Most commonly, one string which is a single punctuation character will be specified. For example, if parse(,) is specified, then "1,2,3" is split into "1", "2" and "3".

It is also possible to specify (1) two or more strings which are alternative separators of `words' and/or (2) strings which consist of two or more characters. Alternative strings should be separated by spaces and strings which include spaces should be bound by " ". Thus if parse(, " ") is specified, then "1,2 3" is also split into "1", "2" and "3". Note particularly the difference between (say) parse(a b) and parse(ab): with the first, "a" and "b" are both acceptable as separators, while with the second, only the string "ab" is acceptable.

limit(#) specifies an upper limit to the number of new variables to be created. Thus limit(2) specifies that at most two new variables should be created.

notrim specifies that the original string variable should not be trimmed of leading and trailing spaces before being parsed. trim is the default.

destring applies destring to the new string variables, replacing the variables initially created as string by numeric variables where possible.

destring_options qualify the application of destring. Possible options are float, force, ignore() and percent. For details, see destring.

Examples 1. Suppose that input is somehow misread as one string variable, say when you copy and paste into the data editor, but data are space-separated:

. split var1, destring

2. Suppose a string variable holds names of legal cases which should be split into variables for plaintiff and defendant. The separators could be " V ", " V. ", " VS " and " VS. ". Note particularly the leading and trailing spaces: "V", for example, would incorrectly split "GOLIATH V DAVID".

. split case, p(" V " " V. " " VS " " VS. ")

Signs of problems would be the creation of more than two variables and any variable having blank values, so check:

. list case2 if case2 == ""

3. Suppose a string variable holds time of day in the form "hh:mm:ss", e.g. "12:34:56".

. split hms, p(:) destring . gen timeofday = hms1 + hms2 / 60 + hms3 / 3600

Or suppose a string variable holds time of day in the form "hh:mm:ss am" or "hh:mm:ss pm", e.g. "06:54:32 am", "11:22:33 pm".

. split hms, p(: " ") destring . gen timeofday = hms1 + hms2 / 60 + hms3 / 3600 + 12 * (hms4 == "pm")

4. Email addresses split at "@":

. split address, p(@)


Nicholas J. Cox, University of Durham, U.K. n.j.cox@durham.ac.uk


This program has benefitted substantially from the work of Michael Blasnik on an earlier jointly written program. Ken Higbee made very useful comments.

Also see

On-line: help for destring, egen (ends()) Manual: [R] destring, [R] egen