Decode string variable into numeric using formats for unlabelled values
sdecode varname [if] [in] , [ generate(newvar) | replace ] [ maxlength(#) format(format_spec) labonly missing ftrim xmlsub esub(esubstitution_rule [, elzero]) prefix(string) suffix(string) ]
msdecode varlist [if] [in] , generate(newvar) [ replace delimiters(string_list) maxlength(#) format(format_spec) labonly missing ftrim xmlsub esub(esubstitution_rule [, elzero]) prefix(string) suffix(string) ]
where format_spec is either a format or a string variable name, and esubstitution_rule is any one of
none | x10 | rtfsuper | texsuper | htmlsuper | smclsuper
Description
sdecode ("super decode") creates an output string variable with values from the input numeric variable varname, using labels if present and formats otherwise. The output string variable may either replace the input numeric variable or be generated as a new variable named newvar. Unlike decode, sdecode creates an output string variable containing the values of the input variable as output by the tabulate command and other Stata output, instead of decoding all unlabelled input values to missing. sdecode is especially useful if a numeric variable has value labels for some values but not for others. msdecode is a multivariate version of sdecode, which inputs a list of numeric variables and (optionally) a list of delimiters, and creates a single string variable, containing the concatenated and decoded values of all the input variables, separated by the delimiters if provided.
Options
For sdecode, either generate() or replace must be specified, but both options may not be specified at the same time. For msdecode, generate must be specified, but replace is optional.
generate(newvar) specifies the name of a new output string variable to be created.
replace, with sdecode, specifies that the output string variable will replace the input numeric variable, and have the same name, the same position in the data set, and the same variable label and characteristics if present. With msdecode, replace specifies that any existing variable with the same name as the generate() variable will be replaced.
delimiters(string_list) (msdecode only) specifies a list of delimiters, to be inserted between the decoded values of successive variables in the input varlist when the output variable is generated. If the number of elements provided is less than the number of input variables minus 1, then the last element is repeated as often as necessary. If the delimiters() option is not provided, then the empty string "" is assumed, and repeated as often as necessary.
maxlength(#) is optional. It specifies how many characters of the value label to retain. # must be an integer between 1 and the maximum string variable length, which is stored in the system parameter c(maxstrvarlen). If unset, then maxlength() is set to the maximum string variable length.
format(format_spec) is optional. It specifies the format (or formats) used for decoding unlabelled values of the input numeric variable. It may be either a format (to be used for all unlabelled values), or the name of a string format variable (in which case each observation with an unlabelled value is decoded using the format stored in the string format variable for that observation). If format() is not specified, then sdecode and msdecode use the format associated with the input numeric variable.
labonly is optional. It specifies that only labelled values for the input numeric variable will be decoded to nonmissing string values in the output string variable, and that unlabelled values will be decoded to a missing string value, as with decode. If labonly is not specified, then all nonmissing values of the input numeric variable will be decoded to nonmissing string values, except for values in observations excluded by the if and in qualifiers, which are decoded to a missing string value.
missing is optional. It specifies that missing values in the input numeric variable will be decoded (using formats) to non-missing formatted string values (such as "."). If missing is absent, then missing values in the input numeric variable are decoded to missing string values.
ftrim is optional. It specifies that values of the output string variable produced using a format will be trimmed to remove spaces on the left and on the right.
xmlsub is optional. It specifies that, in the decoded string output variable, the substrings "&", "<" and ">" will be replaced throughout with the XML entity references "&", "<" and ">", respectively. This is useful if the decoded string output variable is intended for output to a table in a document in XHTML, or in other XML-based languages. This substitution, if specified, is performed before any substitution specified by the esub() option.
esub(esubstitution_rule [, elzero]) is optional. It specifies a rule for substitution of exponents in decoded values produced using the format specified by the format() option, to make them more suitable for output to TeX, HTML, RTF, or other word processor documents. The presence of exponents is normally indicated, in Stata formatted values, by the presence of substrings "e-" or "e+". These substrings may indicate that the substring to the left is a mantissa, and that the substring to the right is the absolute value of an exponent, conventionally presented in documents as a superscript. The possible values of the esubstitution_rule are none, x10, rtfsuper, texsuper, htmlsuper and smclsuper. These rules are documented below under Substitution rules for the esub() option. The suboption elzero, if present, indicates that, if the exponent contains leading zeros, then those leading zeros will be retained in the final formatted value. If the esub() option is specified without the elzero suboption, then such leading zeros are removed.
prefix(string) is optional. It specifies a prefix string, to be added to the left of the generated string variable.
suffix(string) is optional. It specifies a suffix string, to be added to the right of the generated string variable.
Substitution rules for the esub() option
If the user specifies an esub() option, then sdecode and msdecode perform exponent substitution on those values of the output string variable which were produced using the format specified by the format() option. This is done after any trimming specified by the ftrim option and/or any XML entity substitution specified by the xmlsub option, and before any addition of prefixes and suffixes specified by the prefix() and suffix() options.
The first step is to locate the first appearance, in the output string value, of the substring "e-" or the substring "e+", whichever appears first. This substring (if it exists) is known as the <esign>. The substring to the left is known as the <mantissa>, and the substring to the right is known as the <exponent>. An output string value therefore has the syntax
<mantissa> | <mantissa><esign><exponent>
where <mantissa> is a string without any embedded "e-" or "e+" substrings. If elzero is not specified, then the next step is to attempt to remove any leading zeros from the <exponent>, using a method that works if the <exponent> is an unsigned integer.
If an <esign> is present, then the next step is to replace the <esign> with an infix string <eminfix> if the <esign> is "e-", or with an infix string <epinfix> if the <esign> is "e+", and to append a string <esuffix> to the end of the <exponent>. The esubstitution_rule is defined by the values of the <eminfix>, <epinfix> and <esuffix> strings. The revised output string should then have the syntax
<mantissa> | <mantissa><eminfix><exponent><esuffix> | <mantissa><epinfix><exponent><esuffix>
The values for the different esubstitution_rules are as follows:
------------------------------------------------------------------------------- esubstitution_rule <eminfix> <epinfix> <esuffix> Description none "e-" "e+" "" No substitition x10 "x10-" "x10" "" To be superscript > ed manually rtfsuper "x10{\super -" "x10{\super " "}" RTF superscript texsuper "\times 10^{-" "\times 10^{" "}" TeX superscript htmlsuper "x10<sup>-" "x10<sup>" "</sup>" HTML superscript smclsuper "x10{sup:-" "x10{sup:" "}" SMCL superscript -------------------------------------------------------------------------------
Note that, if the user specifies esub(none,elzero), then the result is equivalent to specifying no esub() option. SMCL superscripts are documented in the online help for Stata graphics text.
Remarks
sdecode is a separate package from sencode ("super encode"), which is also downloadable from SSC. However, the two packages both have the alternative generate() and replace options. They are complementary to the destring command and the tostring command, which are part of official Stata. tostring and destring convert numeric values to and from their formatted string values, respectively, but they do not use value labels, and they do contain precautionary features to prevent the loss of information. sdecode and sencode, on the other hand, do use value labels, and allow the possibility that the mapping from numeric values to string values can be many-to-one.
For more about the use of sdecode with listtab and other SSC packages to create tables, see Newson (2012).
Examples
. sdecode price, replace
. sdecode foreign, replace labonly
. sdecode foreign, gene(origin)
. sdecode foreign, gene(origin) maxlen(3)
. replace foreign=_n/_N if mod(_n,2) . sdecode foreign, gene(origin1) . sdecode foreign, gene(origin2) format(%8.4f)
. sdecode rep78, gene(srep78) missing
. sdecode price, gene(sprice) prefix($)
. sdecode weight, gene(sweight) suffix(" lb")
. sdecode weight, gene(esweight) format(%8.1e) esub(htmlsuper)
. msdecode foreign weight price, gene(fwp) delim(", " "lb for $")
. msdecode foreign weight price, gene(fwp) replace delim(" car weighing " " lb and costing ") suffix(" dollars")
Author
Roger Newson, Imperial College London, UK. Email: r.newson@imperial.ac.uk
Newson, R. B. 2012. From resultssets to resultstables in Stata. The Stata Journal 12(2): 191-213. Download from The Stata Journal website.
Also see
Manual: [D] compress, [D] destring, [D] encode, [D] format, [D] functions, [D] generate, [D] label
Help: [D] compress, [D] destring, [D] encode, [D] decode, [D] format, [D] functions, [D] generate, [D] label sencode, listtab if installed