{smcl} {hline} help for {cmd:sencode} {right:(Roger Newson)} {hline} {title:Encode string into numeric in a sequential or other non-alphanumeric order} {p 8 21 2} {cmd:sencode} {varname} {ifin} , [ {cmdab:g:enerate}{cmd:(}{newvar}{cmd:)} | {cmd:replace} ] [ {cmdab:l:abel}{cmd:(}{it:name}{cmd:)} {cmdab:gs:ort}{cmd:(}{it:gsort_list}{cmd:)} {cmdab:man:yto1} ] {pstd} where {it:gsort_list} is a list of one or more elements of the form {p 8 21 2} [{cmd:+}|{cmd:-}]{varname} {pstd} as used by the {helpb gsort} command. {title:Description} {pstd} {cmd:sencode} ("super {helpb encode}") creates an output numeric variable, with {help label:labels} from the input string variable {varname}. The output numeric variable may either replace the input string variable or be generated as a new variable named {newvar}. The labels are specified by creating, adding to, or just using (as necessary) the value label {newvar}, or, if specified, the value label {it:name}. Unlike {helpb encode}, {cmd:sencode} can order the numeric values corresponding to the string values in a logical order, instead of ordering them in alphanumeric order of the string value, as {helpb encode} does. This logical order defaults to the order of appearance of the string values in the data set, but may be an alternative order specified by the user in the {cmd:gsort} option. The mapping of numeric code values to string values may be one-to-one, so that each string value has a single numeric code, or many-to-one, so that each string value may have multiple numeric codes, corresponding to multiple appearances of the string value in the data set. {cmd:sencode} may be useful when the input string variable is used as a source of axis labels in a {help graph:Stata graph} and the output numeric variable is used as the {it:X}-variable or {it:Y}-variable. {title:Options} {p 4 8 2} Either {cmd:generate()} or {cmd:replace} must be specified, but both options may not be specified at the same time. {p 4 8 2}{cmd:generate(}{newvar}{cmd:)} specifies the name of a new output numeric variable to be created. {p 4 8 2}{cmd:replace} specifies that the output numeric variable will replace the input string variable, and have the same name, the same {help order:position in the data set}, and the same {help label:variable label} and {help char:characteristics} if present. {p 4 8 2}{cmd:label(}{it:name}{cmd:)} is optional. It specifies the name of the {help label:value label} to be created or, if the named value label already exists, used and added to as necessary. If {cmd:label} is not specified, then {cmd:sencode} uses the same name for the label as it uses for the new variable, as specified by the {cmd:generate()} or {cmd:replace} option. {p 4 8 2}{cmd:gsort(}{it:gsort_list}{cmd:)} is optional. It specifies a generalized sorting order for the allocation of code numbers to the non-missing values of the input string variable. If the {cmd:gsort()} option is not specified, then it is set to the sequential order of the observation in the data set. The {it:gsort_list} is interpreted in the way used by the {helpb gsort} command. Observations are grouped in ascending or descending order of the specified {varname}s. Each {varname} in the {cmd:gsort()} option can be numeric or string. Observations are grouped in ascending order of {varname} if {cmd:+} or nothing is typed in front of the name, and in descending order of {varname} if {cmd:-} is typed in front of the name. If there are multiple non-missing values of the input string variable in a group specified by the {cmd:gsort()} option, then the group is split into subgroups, one subgroup for each non-missing input string value, and these subgroups are ordered alphanumerically within the group by the input string values. If there are multiple groups with the same input string value, and the {cmd:manyto1} option is not specified, then multiple groups with the same input string value are combined into the first group with that input string value. The ordered groups are then allocated integer code values, and these values are stored in the output variable specified by the {cmd:generate()} or {cmd:replace} option. Note that the data set remains sorted in its original order, even if the user specifies the {cmd:gsort()} option. {p 4 8 2}{cmd:manyto1} is optional. It specifies that the mapping from the numeric codes to the possible values of the input string variable {varname} may be many-to-one, so that each string value may have multiple numeric codes, corresponding to multiple positions of that string value in the data set. These multiple positions may correspond to multiple observations (if {cmd:gsort()} is not specified), or to multiple groups of observations specified by {cmd:gsort()}. If {cmd:manyto1} is not specified, then each string value will have one numeric code, and these numeric codes are usually ordered by the position of the first appearance of the string value in the data set. {title:Technical note} {pstd} {cmd:sencode} encodes the string values in the string input variable as follows. First, {cmd:sencode} selects observations in the data set with non-missing values of the input string variable. If {cmd:if} and/or {cmd:in} is specified, then {cmd:sencode} selects the subset of those observations selected by {cmd:if} and/or {cmd:in}. Then, these observations are grouped into ordered groups, ordered primarily by the {cmd:gsort()} option and secondarily by the value of the input string variable. (If the {cmd:gsort()} option is not specified, then it is set to a single temporary variable, with values set to the expression {cmd:_n}, equal in each observation to the sequential order of that observation in the data set, and there is therefore only one observation per {cmd:gsort()} group.) Then, if {cmd:manyto1} is not specified, any set of multiple ordered groups with the same value of the input string variable is combined into the first group in that set of groups. Each of the ordered groups existing at this stage is then allocated an integer code value. These integer code values are usually ordered primarily by the {cmd:gsort()} option and secondarily by the alphanumeric order of the input string variable. The code values are then stored in the new variable specified by the {cmd:generate()} or {cmd:replace} option. {pstd} Usually, the code values range from one to the final number of groups. Exceptions arise when the {help label:value label} implied by the {cmd:label()} option of {cmd:sencode} is a pre-existing {help label:value label}, with pre-existing associations between numeric code values and string labels already defined. In this case, {cmd:sencode} does not modify existing associations. The consequences of this policy depend on whether or not {cmd:manyto1} is specified. If {cmd:manyto1} is not specified, and the input string value of an ordered group has a pre-existing numeric code, then that pre-existing numeric code continues to be used for that group, and new numeric codes are generated for any input string values without existing numeric codes. If {cmd:manyto1} is specified, and there are existing associations, then a new numeric code is generated for each ordered group, whether or not the input string value for that ordered group has a pre-existing numeric code. In both cases, newly-generated numeric codes are ordered by group, and are chosen to be greater than the greatest pre-existing numeric code. {pstd} These features of {cmd:sencode} may cause problems. Fortunately, these problems can be avoided if a value label name is specified (by the {cmd:label()} option) to be different from any pre-existing value label name, or if existing value labels are removed using the {helpb label:label drop} or {helpb clear} commands. {title:Remarks} {pstd} {cmd:sencode} is a separate package from {helpb sdecode} ("super {helpb decode}"), which is also downloadable from SSC, However, the two packages both have the alternative {cmd:generate} and {cmd:replace} options. They are complementary to the {helpb destring} command (which is part of official Stata) and the {helpb tostring} command (which became part of official Stata in Version 8.1). {helpb tostring} and {helpb destring} convert numeric values to and from their formatted string values, respectively, but they do not use {help label:value labels}, and they do contain precautionary features to prevent the loss of information. {helpb sdecode} and {cmd:sencode}, on the other hand, do use {help label:value labels}, and are based on the principle that the mapping from numeric values to string values can be many-to-one. {title:Examples} {pstd} If we type this example in the {hi:auto} data, then all US-made cars will be ordered before all cars from the rest of the world, and each car type (US and non-US) will be ordered alphanumerically. If we used {helpb encode} instead of {cmd:sencode}, then cars would be ordered alphanumerically by make (so Audi cars would appear before Ford cars). {p 8 12 2}{cmd:. sort foreign make}{p_end} {p 8 12 2}{cmd:. sencode make, replace}{p_end} {p 8 12 2}{cmd:. tab make}{p_end} {pstd} If we type this in the {hi:auto} data, then a new variable {hi:origseq} will created, with a value for each observation equal to the sequential order of that observation in the data set, and a value label for each value {hi:i} equal to the car origin type ({hi:Domestic} or {hi:Foreign}) for the {hi:i}th car. {p 8 12 2}{cmd:. decode foreign, gene(orig)}{p_end} {p 8 12 2}{cmd:. sencode orig, gene(origseq) many}{p_end} {p 8 12 2}{cmd:. lab list origseq}{p_end} {p 8 12 2}{cmd:. tab origseq}{p_end} {p 8 12 2}{cmd:. list foreign origseq, nolab}{p_end} {pstd} If we type this in the {hi:auto} data, then we will encode {hi:make} so that all non-US cars have lower codes than all US cars (so Volvo cars have lower codes than AMC cars), but the data remain sorted as before: {p 8 12 2}{cmd:. sencode make, gene(makeseq) gsort(-foreign)}{p_end} {p 8 12 2}{cmd:. tab makeseq, m}{p_end} {p 8 12 2}{cmd:. lab list makeseq}{p_end} {p 8 12 2}{cmd:. list make makeseq, nolab}{p_end} {title:Author} {pstd} Roger Newson, Imperial College London, UK. Email: {browse "mailto:r.newson@imperial.ac.uk":r.newson@imperial.ac.uk} {title:Acknowledgement} {pstd} This program has benefitted from advice from Nicholas J. Cox, of the University of Durham, U.K., and from Patrick Joly of Industry Canada, Ontario, Canada. {title:Also see} {p 4 13 2} {bind: }Manual: {hi:[D] compress}, {hi:[D] destring}, {hi:[D] encode}, {hi:[D] generate}, {hi:[D] gsort}, {hi:[D]label} {p_end} {p 4 13 2} Online: help for {helpb compress}, {helpb decode}, {helpb destring}, {helpb encode}, {helpb generate}, {helpb gsort}, {helpb label}, {helpb tostring} {break} help for {helpb sdecode} if installed {p_end}