Title
todummy -- Create dummy variables
Syntax
todummy varlist [if] [in] , { values(vlist)|keyword } [options]
where vlist is a list of values specified in one or more numlists of the form
[=]numlist [\ [=]numlist ...]
options Description ------------------------------------------------------------------------- * values(vlist) specify values to be coded 1 percentile interpret vlist as list of percentiles cut interpret vlist as cutpoints
*Keywords levels create one dummy for each level of the original variable median assign value 1 if the original variable is greater or equal to the 50th percentile q create one dummy for each quartile of the original variable
Names generate(namelist) create dummies name1, name2, ... prefix(pre) use pre as prefix for created dummies suffix(suff) use suff as suffix for created dummies stub(stub) use stub1, stub2, ... as dummies' names replace replace existing variables with dummies nonames do not use value labels as variable names (levels)
Labels [r]label(lbllist) use label1, label2, ... as variable labels novarlabel do not assign variable labels
Missing values missing create dummy for missings (levels) or copy missing values
Advanced noskip[(drop)] do not skip creation of existing dummies ro(rel. operator) specify relational operator noexclude use all observations to create dummies, even if excluded by if and/or in qualifiers ------------------------------------------------------------------------- * one of values() or keyword must be specified
Description
todummy creates indicator variables (also called dummies) from variables in varlist. There may either one or multiple dummies be created from each variable. If one dummy per variable is created, default names are d_varname.
Options
values(vlist) assigns value 1 if the original variable equals the values specified in vlist, 0 otherwise. There will be as many dummies per variable as there are numlists in vlist. The first created dummy will be coded 1 if the original variable equals the values in the first numlist, the second dummy will be 1 if the original variable equals the values in the second numlist and so on. If more than one dummy is created the default names are varnameJ, where J indicates the number of the dummy created from the original variable. The dummies will not have variable labels. Non-integer values and missing values (i.e. ., .a, .b, ..., .z) are allowed in numlists. If numlist has missing values, the created dummy will not have missing values.
percentile interprets vlist as a list of percentiles (which must be between 0 and 100). If a numlist contains only one percentile, the created dummy variable will be coded 1 if the original variable is greater or equal to this percentile. Specifying k percentiles, where k > 1, will result in k + 1 dummies created. The first dummy will be coded 1 if the original variable is lower than or equal to the first specified percentile, the second dummy will be coded 1 if the original variable takes on values between the first and the second percentile and so on. An equal sign (=) in front of a numlist causes the first and last dummy not to be created. Thus, specifying k percentiles will result in k - 1 dummies. If more than one dummy per variable is created, default names are varnameJ, where J indicates the number of the dummy created from the original variable. The dummies' variable labels are varname (P), where P indicates the values of the percentiles the dummy represents.
cut interprets vlist as cutpoints. If a numlist contains only one value, the created dummy variable will be coded 1 if the original variable is greater or equal to this value. Specifying k values, where k > 1, will result in k + 1 dummies created. The first dummy will be coded 1 if the original variable is lower than or equal to the first specified value, the second dummy will be coded 1 if the original variable falls into the range between the first and the second value and so on. An equal sign (=) in front of a numlist causes the first and last dummy not to be created. Thus, specifying k values will result in k - 1 dummies. If more than one dummy per variable is created, default names are varnameJ, where J indicates the number of the dummy created from the original variable. The dummies' variable labels are varname (R), where R indicates the range of values the dummy represents. Values may contain missings (i.e. ., .a, .b, ..., .z) and non-integers. If numlist has missing values, the created dummies will not have missing values.
levels creates one dummy for each level of the original variable. This is similar to what tabulate does (note however, that only numerical variables are allowed with todummy). Extended missing values (.a, .b, ..., .z) are copied from the original variable. Value labels from the original variable are used as variable names for the created dummies. If there are no value labels, default names are varnameJ, where J indicates the number of the dummy created from the original variable. The dummies' variable labels are varname (L), where L is the level.
median assigns value 1 if the original variable is greater or equal to its median. The created dummies will not have variable labels.
q creates one dummy for each quartile of the original variable. Thus, four dummies will be created from each variable. The first dummy will be coded 1 if the original variable is lower than or equal to its 25th percentile, the second dummy will be 1 if the original variable takes on values between the 25th and 50th percentile, and so on. The dummies' variable labels are varname (P), where P indicates the values of the percentile the dummy represents.
generate(namelist) creates dummies name1, name2, ... . The number of names specified must equal the number of dummies to be created.
prefix(pre) uses pre as prefix for created dummies. If generate and suffix are not specified, default prefix is d_, if one dummy per variable is to be created. Option prefix may be combined with generate, suffix and stub.
suffix(suff) uses suff as suffix for created dummies. The option may be combined with generate, prefix and stub.
stub(stub) uses stubJ as dummies' names. Here J is the number of the created dummy per variable. The number of stubs specified must equal the number of variables in varlist. The option may be combined with prefix and suffix.
replace replaces existing variables in varlist with dummies. May not be specified with generate, prefix, suffix or stub. If more than one dummy per variable is created, replace is not allowed.
nonames does not use value labels as dummies' names. If specified, dummies' names are varnameJ, where J indicates the number of the dummy created from the original variable. Value labels will be used as variable labels for the created dummies. Only allowed with levels.
[r]label(lbllist) specifies variable labels for the created dummies. If more dummies are created than names are specified, the dummies will not be labeled. Specifying rlabel allows re-using the labels for each original variable, meaning that dummies created from varname1 will have the same labels as dummies created from varname2. Specify "lbl" if lbl contains embedded spaces.
novarlabel does not use variable labels for the dummies. May not be specified with [r]label.
missing creates a dummy for missing values in the original variable if specified with levels. If specified with values, median, or q it causes missing values (., .a, .b, ..., .z) to be copied from the original variable. These values will by default be coded as system missings (.) if numlist has no missing values. If numlist has missing values, there will not be missing values in the created dummies, unless missing is specified.
noskip[(drop)] specifies how to handle existing dummies. In some cases todummy checks the existence of dummy names 'on the fly', meaning not until the dummies are created. If a dummy's name already exists in the dataset, default is to skip the creation of this dummy. This is not considered an error. Therefore a message is displayed but the program will not terminate. Specifying noskip will create a dummy in these cases, choosing a valid variable name. If noskip(drop) is specified, the existing variable will be dropped before creating the dummy. Note that this option differs from replace, which allows variables specified in varlist to be replaced with dummies.
ro(rel. operator) specifies the relational operator used with percentile or cut. Default is >=, meaning value 1 is assigned if the original variable is greater or equal to the specified value. Specifying ro has no effect if more than one dummy per variable is created.
noexclude specifies that observations excluded by the if and/or in qualifiers are to be used to calculate the percentile or get the levels of the original variable. Only allowed with percentile or levels.
Examples
. sysuse nlsw88 ,clear
Create a dummy variable indicating observations with wages above the median wage.
. todummy wage ,values(50) percentile
Do the same using a keyword instead of values and percentile
. todummy wage ,median
Create three dummy variables, the first indicating persons older than 45, the second indicating persons older than 40 and a third indicating persons between ages 38 and 40.
. todummy age ,values(45 \ 40 \ = 38 40) cut
Create a dummy indicating persons working less than 40 hours.
. todummy hours ,values(40) cut ro(<) generate(workhrs)
Create 3 x 4 dummies, representing the four quartiles for the variables age, wage and hours.
. todummy age wage hours ,q rlabel("1st Q" "2nd Q" "3rd Q" "4th Q")
Create two dummies, one indicating managers, the second indicating sales.
. todummy occupation ,values(2 \ 3) generate(managers sales)
Create a dummy for each level of race. Dummies names are white, black and other.
. todummy race ,levels
Create two dummies, one indicating whites, the other indicating blacks or others.
. todummy race ,values(1 \ 2 3) generate(white other)
Remarks
Major changes have been introduced in version 1.2.0 21jul2011 of the program. The most important one regards the handling of missing values. In the current version missing values in the original variable will, in some cases, be coded 0 in the created dummies. This was not the case in versions prior to 1.2.0. Make sure to specify option missing to prevent this behavior if you do not find it convenient. Also option noexclude has changed. The default now is to only use observations not excluded by the if and/or in qualifiers, calculating percentiles and getting the levels of variables. It was the other way round in earlier versions.
Old syntax is still supported if compatible with new functionalities. No longer supported are options binary (introduced in version 1.1.1) and cut(numlist) if numlist contains more than one number. Also, in the current version, at least one option must be specified.
An older version (1.1.2 21may2011) of todummy is available from the author.
Acknowledgments
The programs dummies by Nicholas J. Cox and dummieslab by Philippe Van Kerm and Nick Cox were inspiring. The latter is especially useful to create dummies for each level of the original variable in a more sophisticated way.
Author
Daniel Klein, University of Kassel, klein.daniel.81@gmail.com
Also see
Online: tabulate
if installed: dummies2, dummieslab, dummies