Encode string into numeric in a sequential or other non-alphanumeric order
sencode varname [if] [in] , [ generate(newvar) | replace ] [ label([name][,[replace]]) gsort(gsort_list) manyto1 fast ]
where gsort_list is a list of one or more elements of the form
as used by the gsort command.
sencode ("super encode") creates an output numeric variable, with labels from the input string variable varname. The output numeric variable may either replace the input string variable or be generated as a new variable named newvar. The labels are specified by creating, adding to, or just using (as necessary) the value label newvar, or, if specified, the value label name. Unlike encode, sencode can order the numeric values corresponding to the string values in a logical order, instead of ordering them in alphanumeric order of the string value, as encode does. This logical order defaults to the order of appearance of the string values in the dataset, but may be an alternative order specified by the user in the gsort option. The mapping of numeric code values to string values may be one-to-one, so that each string value has a single numeric code, or many-to-one, so that each string value may have multiple numeric codes, corresponding to multiple appearances of the string value in the dataset. sencode may be useful when the input string variable is used as a source of axis labels in a Stata graph and the output numeric variable is used as the X-variable or Y-variable.
Either generate() or replace must be specified, but both options may not be specified at the same time.
generate(newvar) specifies the name of a new output numeric variable to be created.
replace specifies that the output numeric variable will replace the input string variable, and have the same name, the same position in the dataset, and the same variable label and characteristics if present.
label([name][,[replace]]) is optional. It specifies the name of the value label to be created, or, if the named value label already exists, used and added to as necessary. If label() is not specified, or is specified without the name, then sencode uses the same name for the label as it uses for the new variable, as specified by the generate() or replace option. If the replace suboption of the label() option is specified, then any existing value label with the indicated value label name is dropped before the new values are added, and the added values range from 1 to the number of distinct new labels.
gsort(gsort_list) is optional. It specifies a generalized sorting order for the allocation of code numbers to the non-missing values of the input string variable. If the gsort() option is not specified, then it is set to the sequential order of the observation in the dataset. The gsort_list is interpreted in the way used by the gsort command. Observations are grouped in ascending or descending order of the specified varnames. Each varname in the gsort() option can be numeric or string. Observations are grouped in ascending order of varname if + or nothing is typed in front of the name, and in descending order of varname if - is typed in front of the name. If there are multiple non-missing values of the input string variable in a group specified by the gsort() option, then the group is split into subgroups, one subgroup for each non-missing input string value, and these subgroups are ordered alphanumerically within the group by the input string values. If there are multiple groups with the same input string value, and the manyto1 option is not specified, then multiple groups with the same input string value are combined into the first group with that input string value. The ordered groups are then allocated integer code values, and these values are stored in the output variable specified by the generate() or replace option. Note that the dataset remains sorted in its original order, even if the user specifies the gsort() option.
manyto1 is optional. It specifies that the mapping from the numeric codes to the possible values of the input string variable varname may be many-to-one, so that each string value may have multiple numeric codes, corresponding to multiple positions of that string value in the dataset. These multiple positions may correspond to multiple observations (if gsort() is not specified), or to multiple groups of observations specified by gsort(). If manyto1 is not specified, then each string value will have one numeric code, and these numeric codes are usually ordered by the position of the first appearance of the string value in the dataset.
fast is a programmers' option. It specifies that no action will be taken to restore the original data in the memory, if the user presses the Break key.
sencode encodes the string values in the string input variable as follows. First, sencode selects observations in the dataset with non-missing values of the input string variable. If if and/or in is specified, then sencode selects the subset of those observations selected by if and/or in. Then, these observations are grouped into ordered groups, ordered primarily by the gsort() option and secondarily by the value of the input string variable. (If the gsort() option is not specified, then it is set to a single temporary variable, with values set to the expression _n, equal in each observation to the sequential order of that observation in the dataset, and there is therefore only one observation per gsort() group.) Then, if manyto1 is not specified, then any set of multiple ordered groups with the same value of the input string variable is combined into the first group in that set of groups. Each of the ordered groups existing at this stage is then allocated an integer code value. These integer code values are usually ordered primarily by the gsort() option, and secondarily by the alphanumeric order of the input string variable. The code values are then stored in the new variable specified by the generate() or replace option.
Usually, the code values range from one to the final number of groups. Exceptions arise when the value label implied by the label() option of sencode is a pre-existing value label, with pre-existing associations between numeric code values and string labels already defined, and the replace suboption of the label() option is not specified. In this case, sencode does not modify existing associations. The consequences of this policy depend on whether or not manyto1 is specified. If manyto1 is not specified, and the input string value of an ordered group has a pre-existing numeric code, then that pre-existing numeric code continues to be used for that group, and new numeric codes are generated for any input string values without existing numeric codes. If manyto1 is specified, and there are existing associations, then a new numeric code is generated for each ordered group, whether or not the input string value for that ordered group has a pre-existing numeric code. In both cases, newly-generated numeric codes are ordered by group, and are chosen to be consecutive integers, starting either from 1 or from the smallest integer greater than any pre-existing positive non-missing numeric code, whichever is greater.
These features of sencode may cause problems. Fortunately, these problems can be avoided if a value label name is specified (by the label() option) to be different from any pre-existing value label name, or if the replace suboption of the label() option is used.
sencode is a separate package from sdecode ("super decode"), which is also downloadable from SSC. However, the two packages both have the alternative generate and replace options. They are complementary to the destring command (which is part of official Stata) and the tostring command (which became part of official Stata in Version 8.1). tostring and destring convert numeric values to and from their formatted string values, respectively, but they do not use value labels, and they do contain precautionary features to prevent the loss of information. sdecode and sencode, on the other hand, do use value labels, and are based on the principle that the mapping from numeric values to string values can be many-to-one.
If we type this example in the auto data, then all US-made cars will be ordered before all cars from the rest of the world, and each car type (US and non-US) will be ordered alphanumerically. If we used encode instead of sencode, then cars would be ordered alphanumerically by make (so Audi cars would appear before Ford cars).
. sort foreign make . sencode make, replace . tab make
If we type this in the auto data, then a new variable origseq will created, with a value for each observation equal to the sequential order of that observation in the dataset, and a value label for each value i equal to the car origin type (Domestic or Foreign) for the ith car.
. decode foreign, gene(orig) . sencode orig, gene(origseq) many . lab list origseq . tab origseq . list foreign origseq, nolab
If we type this in the auto data, then we will encode make so that all non-US cars have lower codes than all US cars (so Volvo cars have lower codes than AMC cars), but the data remain sorted as before:
. sencode make, gene(makeseq) gsort(-foreign) . tab makeseq, m . lab list makeseq . list make makeseq, nolab
Roger Newson, Imperial College London, UK. Email: firstname.lastname@example.org
This program has benefitted from advice from Nicholas J. Cox, of the University of Durham, U.K., and from Patrick Joly of Industry Canada, Ontario, Canada.
Manual: [D] compress, [D] destring, [D] encode, [D] generate, [D] gsort, [D]label Online: help for compress, decode, destring, encode, generate, gsort, label, tostring help for sdecode if installed