Decode string variable into numeric using formats for unlabelled values
sdecode varname [if] [in] , [ generate(newvar) | replace ] [ maxlength(#) format(format_spec) labonly missing ftrim xmlsub esub(esubstitution_rule [, elzero]) prefix(string) suffix(string) ]
where format_spec is either a format or a string variable name, and esubstitution_rule is any one of
none | x10 | rtfsuper | texsuper | htmlsuper | smclsuper
Description
sdecode ("super decode") creates an output string variable with values from the input numeric variable varname, using labels if present and formats otherwise. The output string variable may either replace the input numeric variable or be generated as a new variable named newvar. Unlike decode, sdecode creates an output string variable containing the values of the input variable as output by the tabulate command and other Stata output, instead of decoding all unlabelled input values to missing. sdecode is especially useful if a numeric variable has value labels for some values but not for others.
Options
Either generate() or replace must be specified, but both options may not be specified at the same time.
generate(newvar) specifies the name of a new output string variable to be created.
replace specifies that the output string variable will replace the input numeric variable, and have the same name, the same position in the data set, and the same variable label and characteristics if present.
maxlength(#) is optional. It specifies how many characters of the value label to retain. # must be an integer between 1 and the maximum string variable length, which is stored in the system parameter c(maxstrvarlen). If unset, then maxlength() is set to the maximum string variable length.
format(format_spec) is optional. It specifies the format (or formats) used for decoding unlabelled values of the input numeric variable. It may be either a format (to be used for all unlabelled values), or the name of a string format variable (in which case each observation with an unlabelled value is decoded using the format stored in the string format variable for that observation). If format() is not specified, then sdecode uses the format associated with the input numeric variable.
labonly is optional. It specifies that only labelled values for the input numeric variable will be decoded to nonmissing string values in the output string variable, and that unlabelled values will be decoded to a missing string value, as with decode. If labonly is not specified, then all nonmissing values of the input numeric variable will be decoded to nonmissing string values, except for values in observations excluded by the if and in qualifiers, which are decoded to a missing string value.
missing is optional. It specifies that missing values in the input numeric variable will be decoded (using formats) to non-missing formatted string values (such as "."). If missing is absent, then missing values in the input numeric variable are decoded to missing string values.
ftrim is optional. It specifies that values of the output string variable produced using a format will be trimmed to remove spaces on the left and on the right.
xmlsub is optional. It specifies that, in the decoded string output variable, the substrings "&", "<" and ">" will be replaced throughout with the XML entity references "&", "<" and ">", respectively. This is useful if the decoded string output variable is intended for output to a table in a document in XHTML, or in other XML-based languages. This substitution, if specified, is performed before any substitution specified by the esub() option.
esub(esubstitution_rule [, elzero]) is optional. It specifies a rule for substitution of exponents in decoded values produced using the format specified by the format() option, to make them more suitable for output to TeX, HTML, RTF, or other word processor documents. The presence of exponents is normally indicated, in Stata formatted values, by the presence of substrings "e-" or "e+". These substrings may indicate that the substring to the left is a mantissa, and that the substring to the right is the absolute value of an exponent, conventionally presented in documents as a superscript. The possible values of the esubstitution_rule are none, x10, rtfsuper, texsuper, htmlsuper and smclsuper. These rules are documented below under Substitution rules for the esub() option. The suboption elzero, if present, indicates that, if the exponent contains leading zeros, then those leading zeros will be retained in the final formatted value. If the esub() option is specified without the elzero suboption, then such leading zeros are removed.
prefix(string) is optional. It specifies a prefix string, to be added to the left of the decoded string variable.
suffix(string) is optional. It specifies a suffix string, to be added to the right of the decoded string variable.
Substitution rules for the esub() option
If the user specifies an esub() option, then sdecode performs exponent substitution on those values of the output string variable which were produced using the format specified by the format() option. This is done after any trimming specified by the ftrim option and/or any XML entity substitution specified by the xmlsub option, and before any addition of prefixes and suffixes specified by the prefix() and suffix() options.
The first step is to locate the first appearance, in the output string value, of the substring "e-" or the substring "e+", whichever appears first. This substring (if it exists) is known as the <esign>. The substring to the left is known as the <mantissa>, and the substring to the right is known as the <exponent>. An output string value therefore has the syntax
<mantissa> | <mantissa><esign><exponent>
where <mantissa> is a string without any embedded "e-" or "e+" substrings. If elzero is not specified, then the next step is to attempt to remove any leading zeros from the <exponent>, using a method that works if the <exponent> is an unsigned integer.
If an <esign> is present, then the next step is to replace the <esign> with an infix string <eminfix> if the <esign> is "e-", or with an infix string <epinfix> if the <esign> is "e+", and to append a string <esuffix> to the end of the <exponent>. The esubstitution_rule is defined by the values of the <eminfix>, <epinfix> and <esuffix> strings. The revised output string should then have the syntax
<mantissa> | <mantissa><eminfix><exponent><esuffix> | <mantissa><epinfix><exponent><esuffix>
The values for the different esubstitution_rules are as follows:
------------------------------------------------------------------------------- esubstitution_rule <eminfix> <epinfix> <esuffix> Description none "e-" "e+" "" No substitition x10 "x10-" "x10" "" To be superscript > ed manually rtfsuper "x10{\super -" "x10{\super " "}" RTF superscript texsuper "\times 10^{-" "\times 10^{" "}" TeX superscript htmlsuper "x10<sup>-" "x10<sup>" "</sup>" HTML superscript smclsuper "x10{sup:-" "x10{sup:" "}" SMCL superscript -------------------------------------------------------------------------------
Note that, if the user specifies esub(none,elzero), then the result is equivalent to specifying no esub() option. SMCL superscripts are documented in the online help for Stata graphics text.
Remarks
sdecode is a separate package from sencode ("super encode"), which is also downloadable from SSC. However, the two packages both have the alternative generate() and replace options. They are complementary to the destring command and the tostring command, which are part of official Stata. tostring and destring convert numeric values to and from their formatted string values, respectively, but they do not use value labels, and they do contain precautionary features to prevent the loss of information. sdecode and sencode, on the other hand, do use value labels, and allow the possibility that the mapping from numeric values to string values can be many-to-one.
Examples
. sdecode price, replace
. sdecode foreign, replace labonly
. sdecode foreign, gene(origin)
. sdecode foreign, gene(origin) maxlen(3)
. replace foreign=_n/_N if mod(_n,2) . sdecode foreign, gene(origin1) . sdecode foreign, gene(origin2) format(%8.4f)
. sdecode rep78, gene(srep78) missing
. sdecode price, gene(sprice) prefix($)
. sdecode weight, gene(sweight) suffix(" lb")
. sdecode weight, gene(esweight) format(%8.1e) esub(htmlsuper)
Author
Roger Newson, Imperial College London, UK. Email: r.newson@imperial.ac.uk
Also see
Manual: [D] compress, [D] destring, [D] encode, [D] format, [D] functions, [D] generate, [D] label
Help: [D] compress, [D] destring, [D] encode, [D] decode, [D] format, [D] functions, [D] generate, [D] label sencode if installed