{smcl}
{* 21mar2006/1feb2007/21feb2007/24may2007/9jan2009/8feb2010/29feb2012/14feb2014/20nov2016}{...}
{hline}
help for {cmd:egenmore}
{hline}
{title:Extensions to generate (more extras)}
{p 8 17 2}{cmd:egen}
[{it:type}]
{it:newvar}
{cmd:=}
{it:fcn}{cmd:(}{it:arguments}{cmd:)}
[{cmd:if} {it:exp}]
[{cmd:in} {it:range}]
[{cmd:,} {it:options}]
{title:Description}
{p 4 4 2}
{help egen} creates {it:newvar} of the optionally specified storage type
equal to {it:fcn}{cmd:(}{it:arguments}{cmd:)}. Depending on
{it:fcn}{cmd:()}, {it:arguments} refers to an expression, a
{help varlist}, a {help numlist}, or an empty string. The options are
similarly function dependent.
{title:Functions}
{p 4 4 2}
(The option {cmd:by(}{it:byvarlist}{cmd:)} means that computations are
performed separately for each group defined by {it:byvarlist}.)
{p 4 4 2}
Functions are grouped thematically as follows:{p_end}
{space 8}Grouping and graphing
{space 8}Strings, numbers and conversions
{space 8}Dates, times and time series
{space 8}Summaries and estimates
{space 8}First and last
{space 8}Random numbers
{space 8}Row operations
{title:Grouping and graphing}
{p 4 8 2}
{cmd:axis(}{it:varlist}{cmd:)}
[
{cmd:, gap}
{cmd:label(}{it:lblvarlist}{cmd:)}
{cmdab:miss:ing}
{cmdab:rev:erse}
]
resembles {help egen}'s {cmd:group()}, but is specifically designed for
constructing categorical axis variables for graphs, hence the name. It
creates a single variable taking on values 1, 2, ... for the groups
formed by {it:varlist}. {it:varlist} may contain string, numeric, or
both string and numeric variables. The order of the groups is that of
the sort order of {it:varlist}. {cmd:gap} overrides the default
numbering of 1 up by adding a gap of 1 whenever a variable changes.
{cmd:label()} specifies that labels are to be assigned based on the
value labels or values of {it:lblvarlist}; if not specified,
{it:lblvarlist} defaults to {it:varlist}. {cmd:missing} indicates that
missing values in {it:varlist} (either numeric missing or {cmd:""}) are to be
treated like any other value when assigning groups, instead of missing
values being assigned to the group missing. {cmd:reverse} reverses
labelling so that groups that would have been assigned values of 1 ...
whatever are instead assigned values of whatever ... 1. (Stata 8
required.)
{p 4 4 2}
To order groups of a categorical variable according
to their values of another variable, in preparation
for a graph or table:
{p 4 8 2}{cmd:. egen meanmpg = mean(-mpg), by(rep78)}{p_end}
{p 4 8 2}{cmd:. egen Rep78 = axis(meanmpg rep78), label(rep78)}{p_end}
{p 4 8 2}{cmd:. tabstat mpg, by(Rep78) s(min mean max)}
{p 4 4 2}Note: the function author considers this approach superseded by
his {cmd:seqvar} and {cmd:labmask} (Cox 2008).
{p 4 8 2}
{cmd:clsst(}{it:varname}{cmd:)}
{cmd:,}
{cmdab:v:alues(}{it:numlist}{cmd:)}
[
{cmdab:l:ater}
]
returns whichever of the {it:numlist} in {cmd:values()} is closest
(differs by least, disregarding sign) to the numeric variable
{it:varname}. {cmd:later} specifies that in the event of ties values
specified later in the list overwrite values specified earlier. If
varname is 15 then 10 and 20 specified by {cmd:values(10 20)} are
equally close. For any observation containing 15 the default is that 10
is reported, whereas with {cmd:later} 20 is reported. For a {it:numlist}
containing an increasing sequence, {cmd:later} implies choosing the
higher of two equally close values. (Stata 6 required.)
{p 4 8 2}{cmd:. egen mpgclass = clsst(mpg), v(10(5)40)}
{p 4 8 2}
{cmd:egroup(}{it:varlist}{cmd:)} is a extension of {help egen}'s
{cmd:group()} function with the extra option
{cmd:label(}{it:lblvarlist}{cmd:)}, which will attach the original
values (or value labels if they exist) of {it:lblvarlist} as value
labels. This option may not be combined with the {cmd:label} option.
(Stata 7 required; superseded by {cmd:axis()} above.)
{p 4 8 2}
{cmdab:group2(}{it:varlist}{cmd:)} is a generalisation of
{help egen}'s {cmd:group()} with the extra option
{cmd:sort(}{it:egen_call}{cmd:)}. Groups of {it:varlist} will have
values 1 upwards according to their values on the results of a specified
{it:egen_call}. For example, {cmd:group2(rep78) sort(mean(mpg))} will
produce a variable such that the group of {cmd:rep78} with the lowest
mean of {cmd:mpg} will have value 1, that with the second lowest mean
will have value 2, and so forth. As with {cmd:group()}, the
{cmd:label} option will attach the original values of {it:varlist} (or
value labels if they exist) as value labels. The argument of
{cmd:sort()} must be a valid call to an {cmd:egen} function, official or
otherwise. (Stata 7 required; use of {cmd:egroup()} or
{cmd:axis()} above is now considered better style.)
{p 4 8 2}{cmd:mlabvpos(}{it:yvar xvar}{cmd:)}
[
{cmd:,}
{cmd:log}
{cmdab:poly:nomial(}{it:#}{cmd:)}
{cmdab:mat:rix(}{it:5x5 matrix}{cmd:)}
]
automatically generates a variable giving clock positions of marker labels
given names of variables {it:yvar} and {it:xvar} defining the axes of a scatter
plot. Thus the command generates a variable to be used in the {help scatter}
option {cmd:mlabvpos()}.
{p 8 8 2}
The general idea is to pull marker labels away from the
data region. So, marker labels in the lower left of the
region are at clock positions 7 or 8, and those in the upper right
are at clock-position 1 or 2, etc.
More precisely, considering the following rectangle as the data region,
then marker labels are placed as follows:
{col 9}{c TLC}{hline 14}{c TRC}
{col 9}{c |}11 12 12 12 1{c |}
{col 9}{c |}10 11 12 1 2{c |}
{col 9}{c |} 9 9 12 3 3{c |}
{col 9}{c |} 8 7 6 5 4{c |}
{col 9}{c |} 7 6 6 6 5{c |}
{col 9}{c BLC}{hline 14}{c BRC}
{p 8 8 2}
Note that there is no attempt to prevent marker labels from overplotting,
which is likely in any dataset with many observations. In such situations
you might be better off simply randomizing clock positions with say
{cmd:ceil(uniform() * 12)}.
{p 8 8 2}
If {it:yvar} and {it:xvar} are highly correlated, than the clock-positions are generated
as follows (which is however the same general idea):
{col 9}{c TLC}{hline 14}{c TRC}
{col 9}{c |} 12 1 3{c |}
{col 9}{c |} 12 12 3 4{c |}
{col 9}{c |}11 11 12 5 5{c |}
{col 9}{c |}10 9 6 6 {c |}
{col 9}{c |} 9 7 6 {c |}
{col 9}{c BLC}{hline 14}{c BRC}
{p 8 8 2}
To calculate the positions, the x axis is first categorized into 5 equal
intervals around the mean of {it:xvar}. Afterwards the residuals from regression of
{it:yvar} on {it:xvar} are categorized into 5 equal intervals. Both categorized
variables are then used to calculate the positions according to the first
table above. The rule can be changed with the option {cmd:matrix()}.
{p 8 8 2}
{cmd:log} indicates that residuals from regression are to be calculated
using the logarithms of {it:xvar}. This might be useful if the scatter
shows a strong curvilinear relationship.
{p 8 8 2}
{cmd:polynomial(}{it:#}{cmd:)} indicates that residuals are to be calculated
from a regression of {it:yvar} on a polynomial of {it:xvar}. For example, use
{cmd:poly(2)} if the scatter shows a U-shaped relationship.
{p 8 8 2}
{cmd:matrix(}{it:#}{cmd:)} is used to change the general rule for the plot positions.
The positions are specified by a 5 x 5 matrix, in which cell [1,1] gives the clock
position of marker labels in the upper left part of the data region, and so forth.
(Stata 8.2 required.)
{p 4 8 2}{cmd:. egen clock = mlabvpos(mpg weight)}{p_end}
{p 4 8 2}{cmd:. scatter mpg weight, mlab(make) mlabvpos(clock)}{p_end}
{p 4 8 2}{cmd:. egen clock2 = mlabvpos(mpg weight), matrix(11 1 12 11 1 \\ 10 2 12 10 2 \\ 9 3 12 9 3 \\ 8 4 6 8 4 \\ 7 5 6 7 5)}{p_end}
{p 4 8 2}{cmd:. sc mpg weight, mlab(make) mlabvpos(clock2)}
{title:Strings, numbers and conversions}
{p 4 8 2}
{cmd:base(}{it:varname}{cmd:)}
[
{cmd:,}
{cmdab:b:ase(}{it:#}{cmd:)}
]
produces a string variable containing the digits of a base {it:#}
(default 2, possible values 2(1)9) representation of {it:varname}, which
must contain integers. Thus if {it:varname} contains values 0, 1, 2, 3,
4, and the default base is used, then the result will contain the
strings {cmd:"000"}, {cmd:"001"}, {cmd:"010"}, {cmd:"011"}, {cmd:"100"}.
If any integer values are negative, all string values will start with
{cmd:-} if negative and {cmd:+} otherwise. See also {cmd:decimal()}. The
examples show how to unpack this string into individual digits if
desired. (Stata 6 required.)
{p 4 8 2}{cmd:. egen binary = base(code)}
{p 4 4 2}Suppose {cmd:binary} is {cmd:str5}.
To get individual {cmd:str1} variables,
{p 4 8 2}{cmd:. forval i = 1/5 {c -(}}{p_end}
{p 4 8 2}{cmd:. {space 8}gen str1 code`i' = substr(binary, `i',1)}{p_end}
{p 4 8 2}{cmd:. {c )-}}
{p 4 4 2}and to get individual numeric variables,
{p 4 8 2}{cmd:. forval i = 1/5 {c -(}}{p_end}
{p 4 8 2}{cmd:. {space 8}gen byte code`i' = real(substr(binary, `i', 1))}{p_end}
{p 4 8 2}{cmd:. {c )-}}
{p 4 8 2}
{cmd:decimal(}{it:varlist}{cmd:)}
[
{cmd:,}
{cmdab:b:ase(}{it:#}{cmd:)}
]
treats the values of {it:varlist} as indicating digits in a base {it:#}
(default 2, possible values integers >=2) representation of a number and
produces the decimal equivalent. Thus if three variables are given with
values in a single observation of 1 1 0, and the default base is used,
the decimal result is 1 * 2^2 + 1 * 2^1 + 0 * 2^0 = 4 + 2 + 0 = 6.
Similarly if base 5 is used, the decimal equivalent of 2 3 4 is 2 * 5^2
+ 3 * 5^1 + 4 * 5^0 = 50 + 15 + 4 = 59. Note that the order of variables
in {it:varlist} is crucial. (Stata 7 required.)
{p 4 8 2}{cmd:. egen decimal = decimal(q1-q8)}
{p 4 8 2}
{cmd:incss(}{it:strvarlist}{cmd:)}
{cmd:,}
{cmdab:s:ubstr(}{it:substring}{cmd:)}
[
{cmdab:i:nsensitive}
]
indicates occurrences of {it:substring} within any of the variables in a list
of string variables by 1 and other observations by 0. {cmd:insensitive}
makes comparison case-insensitive. (Stata 6 required; an alternative is
now just to use {help foreach}.)
{p 4 8 2}{cmd:. egen buick = incss(make), sub(buick) i}
{p 4 8 2}
{cmd:iso3166(}{it:varname}{cmd:)}
[{cmd:,}
{cmdab:o:rigin(}{cmd:codes}|{cmd:names}{cmd:)}
{cmdab:l:anguage(}{cmd:en}|{cmd:fr}{cmd:)}
{cmdab:v:erbose}
{cmdab:u:pdate}]
maps {it:varname} containing "official short country names" into
a new variable containing the ISO 3166-1-alpha-2 code elements
(e.g. DE for "Germany", GB for "United Kingdom" and HM for "Heard
Island and McDonald Islands") and vice versa. The official short
country names can be in English (default) or French. Correspondingly
the function produces country names from ISO 3166-1-alpha-2 codes in
English or French. (Version 9.2 required.)
{p 8 8 2}{cmdab:o:rigin(}{cmd:codes}|{cmd:names}{cmd:)} declares the
character of
the country variable that is already in the data. The default is
{cmd:names}, meaning that {it:varname} holds the "official short country
names". This information may be stored as a string variable or as a
numeric variable that is labeled accordingly. This default setting
produces ISO 3166-1-alpha-2 codes from the country names. If country
names should be produced from the two letter codes, use
{cmd:egen} {it:newvar} {cmd:= iso3166(}{it:varname}{cmd:), origin(codes)}.
{p 8 8 2}{cmdab:l:anguage(}{cmd:en}|{cmd:fr}{cmd:)} defines the language in
which the country names are stored, or should be
produced. {cmd:language(en)} is for English names (default);
{cmd:language(fr)} is for French names.
{p 8 8 2}{cmdab:v:erbose} For the mapping from country names to
ISO 3166-1-alpha2 codes the program expects official short
country names. It cannot handle unofficial country names such as
"Great Britain", "Taiwan" or "Russia". Such unofficial country names
result in the generation of missing values for the respective
countries. By default {cmd:iso3166()} only returns the number of
missing values it has produced. With {cmd:verbose} Stata also provides
the list of unofficial country names in {it:varname} and a clickable
link to the list of official country names. This is convenient
if one wants to correct the information stored in {it:varname} before
using {cmd:iso3166()}. For the transformation of ISO 3166-1-alpha2
codes into country names, {cmd:verbose} does something
equivalent.
{p 8 8 2}{cmdab:u:pdate} The ISO 3166-1-alpha2 codes are automaticaly
looked up in information provided by the ISO 3166 Maintenance
Agency of the International Organization for Standardization. The
information is automatically downloaded from the internet when the user
specifies {cmd:iso3166()} the first time, or
whenever {cmd:update} is specified. Note: Updating the matching list
regularly will guarantee that {cmd:iso3166()} always produces
up-to-date country names. However, updating the match list may also
produce missing values when running older do-files for data sets with
countries that no longer exist (for example, Yugoslavia).
{p 8 8 2}Note the implications: This function will only work if your copy of
Stata can access the internet, at least for the first time it is called.
The results of the function might be not fully reproducible in the
future.
{p 4 8 2}
{cmd:msub(}{it:strvar}{cmd:)}
{cmd:,}
{cmdab:f:ind(}{it:findstr}{cmd:)}
[
{cmdab:r:eplace(}{it:replacestr}{cmd:)}
{cmd:n(}{it:#}{cmd:)}
{cmdab:w:ord}
]
replaces occurrences of the words of {it:findstr} by the words of
{it:replacestr} in the string variable {it:strvar}. The words of
{it:findstr} and of {it:replacestr} are separated by spaces or bound by
{cmd:" "}: thus {cmd:find(a b "c d")} includes three words, in turn
{cmd:"a"}, {cmd:"b"} and {cmd:"c d"}, and double quotation marks
{cmd:" "} should be used to delimit any word including one or more spaces. The
number of words in {it:findstr} should equal that in {it:replacestr},
except that (1) an empty {it:replacestr} is taken to specify deletion;
(2) a single word in {it:replacestr} is taken to mean that each word of
{it:findstr} is to be replaced by that word. As quotation marks are used
for delimiting, literal quotation marks should be included in
compound double quotation marks, as in {cmd:`"""'}. By default all occurrences
are changed. {cmd:n(}{it:#}{cmd:)} specifies that the first {it:#}
occurrences only should be changed. {cmd:word} specifies that words in
{it:findstr} are to be replaced only if they occur as separate words in
{it:strvar}. The substitutions of {cmd:msub()} are made in sequence.
(Stata 6 required; {cmd:msub()} depends on the built-in functions
{help subinstr()} and {help subinword()}.)
{p 4 8 2}{cmd:. egen newstr = msub(strvar), f(A B C) r(1 2 3)}{p_end}
{p 4 4 2}(replaces {cmd:"A"} by {cmd:"1"}, {cmd:"B"} by {cmd:"2"}, {cmd:"C"} by {cmd:"3"})
{p 4 8 2}{cmd:. egen newstr = msub(strvar), f(A B C) r(1 2 3) n(1)}{p_end}
{p 4 4 2}(replaces {cmd:"A"} by {cmd:"1"}, {cmd:"B"} by {cmd:"2"}, {cmd:"C"} by {cmd:"3"}, first occurrence only)
{p 4 8 2}{cmd:. egen newstr = msub(strvar), f(A B C) r(1)}{p_end}
{p 4 4 2}(replaces {cmd:"A"} by {cmd:"1"}, {cmd:"B"} by {cmd:"1"}, {cmd:"C"} by {cmd:"1"})
{p 4 8 2}{cmd:. egen newstr = msub(strvar), f(A B C)}{p_end}
{p 4 4 2}(deletes {cmd:"A"}, {cmd:"B"}, {cmd:"C"})
{p 4 8 2}{cmd:. egen newstr = msub(strvar), f(" ")}{p_end}
{p 4 4 2}(deletes spaces)
{p 4 8 2}{cmd:. egen newstr = msub(strvar), f(`"""')}{p_end}
{p 4 4 2}(deletes quotation mark {cmd:"})
{p 4 8 2}{cmd:. egen newstr = msub(strvar) f(frog) w}{p_end}
{p 4 4 2}(deletes {cmd:"frog"} only if occurring as single word)
{p 4 8 2}
{cmd:noccur(}{it:strvar}{cmd:)}
{cmd:,}
{cmdab:s:tring(}{it:substr}{cmd:)}
creates a variable containing the number of occurrences of the string
{it:substr} in string variable {it:strvar}. Note that occurrences must
be disjoint (non-overlapping): thus there are two occurrences of
{cmd:"aa"} within {cmd:"aaaaa"}. (Stata 7 required.)
{p 4 8 2}
{cmd:nss(}{it:strvar}{cmd:)}
{cmd:,}
{cmdab:f:ind(}{it:substr}{cmd:)}
[
{cmdab:i:nsensitive}
]
returns the number
of occurrences of {it:substr} within the string variable {it:strvar}.
{cmd:insensitive} makes counting case-insensitive. (Stata 6 required.)
{p 4 4 2}The inclusion of {cmd:noccur()} and {cmd:nss()}, two almost
identical functions, was an act of sheer inadvertence by the maintainer.
{p 4 8 2}
{cmd:ntos(}{it:numvar}{cmd:)}
{cmd:,}
{cmdab:f:rom(}{it:numlist}{cmd:)}
{cmdab:t:o(}{it:list of string values}{cmd:)}
generates a string variable from a numeric variable {it:numvar}, mapping
each numeric value in {it:numlist} to the corresponding string value.
The number of elements in each list must be the same. String values
containing blanks should be delimited by doube quotation marks
{cmd:" "}. Values not defined by the mapping are generated as missing. The type
of the string variable is determined automatically. (Stata 6 required.)
{p 4 8 2}{cmd:. egen grade = ntos(Grade), from(1/5) to(Poor Fair Good "Very good" Excellent)}
{p 4 8 2}
{cmd:nwords(}{it:strvar}{cmd:)} returns the number of words within the string
variable {it:strvar}. Words are separated by spaces, unless bound by double
quotation marks {cmd:" "}. (Stata 6 required; superseded by
{help wordcount()}).
{p 4 8 2}
{cmd:repeat()}
{cmd:,}
{cmdab:v:alues(}{it:value_list}{cmd:)}
[
{cmd:by(}{it:byvarlist}{cmd:)}
{cmdab:b:lock(}{it:#}{cmd:)}
]
produces a repeated sequence of {it:value_list}. The items of {it:value_list},
which may be a {it:numlist} or a set of string values, are assigned cyclically to
successive observations. The order of observations is determined (1) after
noting any {cmd:if} or {cmd:in} restrictions; (2) within groups specified by
{cmd:by()}, if issued; (3) by the current sort order. {cmd:block()} specifies
that values should be repeated in blocks of the specified size: the default is
1. The variable type is determined smartly, and need not be specified. (Stata 8
required.)
{p 4 8 2}{cmd:. egen quarter = repeat(), v(1/4) block(3)}{p_end}
{p 4 8 2}{cmd:. egen months = repeat(), v(`c(Months)')}{p_end}
{p 4 8 2}{cmd:. egen levels = repeat(), v(10 50 200 500)}
{p 4 8 2}
{cmd:sieve(}{it:strvar}{cmd:)}
{cmd:,}
{c -(}
{cmd:keep(}{it:classes}{cmd:)}
{c |}
{cmd:char(}{it:chars}{cmd:)}
{c |}
{cmd:omit(}{it:chars}{cmd:)}
{c )-}
selects characters from {it:strvar} according to a specified criterion and
generates a new string variable containing only those characters. This may be
done in three ways. First, characters are classified using the keywords
{cmd:alphabetic} (any of {cmd:a-z} or {cmd:A-Z}), {cmd:numeric} (any of
{cmd:0-9}), {cmd:space} or {cmd:other}. {cmd:keep()} specifies one or more of
those classes: keywords may be abbreviated by as little as one letter. Thus
{cmd:keep(a n)} selects alphabetic and numeric characters and omits spaces and
other characters. Note that keywords must be separated by spaces.
Alternatively, {cmd:char()} specifies each character to be selected or
{cmd:omit()} specifies each character to be omitted. Thus
{cmd:char(0123456789.)} selects numeric characters and the stop (presumably as
decimal point); {cmd:omit(" ")} strips spaces and {cmd:omit(`"""')} strips
double quotation marks. (Stata 7 required.)
{p 4 8 2}
{cmd:ston(}{it:strvar}{cmd:)}
{cmd:,}
{cmdab:f:rom(}{it:list of string values}{cmd:)}
{cmdab:t:o(}{it:numlist}{cmd:)}
generates a numeric variable from a string variable {it:strvar}, mapping each
string value to the corresponding numeric value in {it:numlist}. The number of
elements in each list must be the same. String values containing blanks should
be delimited by {cmd:" "}. Values not defined by the mapping are generated as
missing. (Stata 6 required.)
{p 4 8 2}{cmd:. egen Grade = ston(grade), to(1/5) from(Poor Fair Good "Very good" Excellent)}
{p 4 8 2}
{cmd:truncdig(}{it:varname}{cmd:), dig(}{it:#}{cmd:)}
truncates a numeric variable at the specified number of decimal digits.
It applies the {cmd:trunc()} or {cmd:int()} function to the variable
times 10^{cmd:dig}, then divides by 10^{cmd:dig}. The {cmd:dig()}
argument may be positive, zero or negative. If negative, it creates a
binned variable: for instance, with income in dollars,
{cmd:egen inck = truncdig(income), dig(-3)} creates a measure of income
expressed in whole thousands of dollars. (Stata 12 required.)
{p 4 8 2}
{cmd:wordof(}{it:strvar}{cmd:)}
{cmd:,}
{cmdab:w:ord(}{it:#}{cmd:)} returns the {it:#}th word of string variable
{it:strvar}. {cmd:word(1)} is the first word, {cmd:word(2)} the second word,
{cmd:word(-1)} the last word, and so forth. Words are separated by spaces,
unless bound by quotation marks {cmd:" "}. (Stata 6 required; superseded
by {help word()}.)
{title:Dates, times and time series}
{p 4 8 2}
{cmd:bom(}{it:m y}{cmd:)}
[
{cmd:,}
{cmdab:l:ag(}{it:lag}{cmd:)}
{cmdab:f:ormat(}{it:format}{cmd:)}
{cmdab:w:ork}
]
creates an elapsed date variable containing the date of the beginning of
month {it:m} and year {it:y}. {it:m} can be a variable containing
integers between 1 and 12 inclusive or a single integer in that range.
{it:y} can be a variable containing integers within the range covered by
elapsed dates or a single integer within that range. Optionally
{cmd:lag()} specifies a lag: the beginning of the month will be given
for {cmd:lag} months before the current date. {cmd:lag(1)} refers to the
previous month, {cmd:lag(3)} to 3 months ago and {cmd:lag(-3)} to 3
months hence. The {cmd:lag} may also be specified by a variable
containing integers. Optionally a format, usually but not necessarily a
date format, can be specified. {cmd:work} specifies that the first day
must also be one of Monday to Friday. (Stata 6 required.)
{p 4 8 2}{cmd:. egen bom = bom(month year), f(%dd_m_y)}{p_end}
{p 4 8 2}
{cmd:bomd(}{it:datevar}{cmd:)}
[
{cmd:,}
{cmdab:l:ag}{cmd:(}{it:lag}{cmd:)}
{cmdab:f:ormat}{cmd:(}{it:format}{cmd:)}
{cmdab:w:ork}
]
creates an elapsed date variable containing the date of the beginning of
the month containing the date in an elapsed date variable {it:datevar}.
Optionally {cmd:lag()} specifies a lag: the beginning of the month will
be given for {cmd:lag} months before the current date. {cmd:lag(1)}
refers to the previous month, {cmd:lag(3)} to 3 months ago and
{cmd:lag(-3)} to 3 months hence. The {cmd:lag} may also be specified by
a variable containing integers. Optionally a format, usually but not
necessarily a date format, can be specified. {cmd:work} specifies that
the first day must also be one of Monday to Friday. (Stata 6 required.)
{p 4 8 2}{cmd:. egen bomd = bomd(date), f(%dd_m_y)}
{p 4 4 2}
Note that {cmd:work} knows nothing about holidays or any special days.
{p 4 8 2}
{cmd:dayofyear(}{it:daily_date_variable}{cmd:)}
[
{cmd:,}
{cmdab:m:onth(}{it:#}{cmd:)}
{cmdab:d:ay(}{it:#}{cmd:)}
]
generates the day of the year, counting from the
start of the year, from a daily date variable. The
start of the year is 1 January by default: {cmd:month()}
and/or {cmd:day()} may be used to specify an alternative.
This function thus is a generalisation of the date
function {help doy()}.
(Stata 8 required.)
{p 4 8 2}{cmd:. egen dayofyear = dayofyear(date), m(10)}
{p 4 8 2}
{cmd:dhms(}{it:d h m s}{cmd:)}
[
{cmd:,}
{cmdab:f:ormat(}{it:format}{cmd:)}
]
creates a date variable from Stata date variable or date {it:d} with a
fractional part reflecting the number of hours, minutes and seconds past
midnight. {it:h} can be a variable containing integers between 0 and 23
inclusive or a single integer in that range. {it:m} and {it:s} can be
variables containing integers between 0 and 59 or single integer(s) in
that range. Optionally a format, usually but not necessarily a date
format, can be specified. The resulting variable, which is by default
stored as a double, may be used in date and time arithmetic in which the
time of day is taken into account. (Stata 6 required.)
{p 4 8 2}
{cmd:elap(}{it:time}{cmd:)}
[
{cmd:,}
{cmdab:f:ormat(}{it:format}{cmd:)}
]
creates a string variable which contains the number of days, hours,
minutes and seconds associated with an integer variable containing a
number of elapsed seconds. Such a variable might be the result of
date/time arithmetic, where a time interval between two timestamps has
been expressed in terms of elapsed seconds. Leading zeroes are included
in the hours, minutes, and seconds fields. Optionally, a format can be
specified. (Stata 6 required.)
{p 4 8 2}
{cmd:elap2(}{it:time1 time2}{cmd:)}
[
{cmd:,}
{cmdab:f:ormat(}{it:format}{cmd:)}
]
creates a string variable which contains the number of days, hours,
minutes and seconds associated with a pair of time values, expressed as
fractional days, where {it:time1} is no greater than {it:time2}. Such
time values may be generated by function {cmd:dhms()}. {cmd:elap2()}
expresses the interval between these time values in readable form.
Leading zeroes are included in the hours, minutes, and seconds fields.
Optionally, a format can be specified. (Stata 6 required.)
{p 4 8 2}
{cmd:eom(}{it:m y}{cmd:)}
[
{cmd:,}
{cmdab:l:ag(}{it:lag}{cmd:)}
{cmdab:f:ormat(}{it:format}{cmd:)}
{cmdab:w:ork}
]
creates an elapsed date variable containing the date of the end of month
{it:m} and year {it:y}. {it:m} can be a variable containing integers between 1 and 12
inclusive or a single integer in that range. {it:y} can be a variable
containing integers within the range covered by elapsed dates or a
single integer within that range. Optionally {cmd:lag()} specifies a
lag: the end of the month will be given for {cmd:lag} months before the
current date. {cmd:lag(1)} refers to the previous month, {cmd:lag(3)} to
3 months ago and {cmd:lag(-3)} to 3 months hence. The {cmd:lag} may also
be specified by a variable containing integers. Optionally a format,
usually but not necessarily a date format, can be specified. {cmd:work}
specifies that the last day must also be one of Monday to Friday.
(Stata 6 required.)
{p 4 8 2}{cmd:. egen eom = eom(month year), f(%dd_m_y)}
{p 4 8 2}
{cmd:eomd(}{it:datevar}{cmd:)}
[
{cmd:,}
{cmdab:l:ag(}{it:lag}{cmd:)}
{cmdab:f:ormat(}{it:format}{cmd:)}
{cmdab:w:ork}
]
creates an elapsed date variable containing the date of the end of the
month containing the date in an elapsed date variable {it:datevar}.
Optionally {cmd:lag()} specifies a lag: the end of the month will be
given for {cmd:lag} months before the current date. {cmd:lag(1)} refers
to the previous month, {cmd:lag(3)} to 3 months ago and {cmd:lag(-3)} to
3 months hence. The {cmd:lag} may also be specified by a variable
containing integers. Optionally a format, usually but not necessarily a
date format, can be specified. {cmd:work} specifies that the last day
must also be one of Monday to Friday. (Stata 6 required.)
{p 4 4 2}Note that {cmd:work} knows nothing about holidays
or any special days.
{p 4 8 2}{cmd:. egen eom = eomd(date), f(%dd_m_y)}{p_end}
{p 4 8 2}{cmd:. egen eopm = eomd(date), f(%dd_m_y) lag(1)}
{p 4 8 2}
{cmd:ewma(}{it:timeseriesvar}{cmd:)}
{cmd:,}
{cmd:a(}{it:#}{cmd:)}
calculates the exponentially weighted moving average, which is
{p 8 8 2}
{it:ewma} = {it:timeseriesvar} for the first observation
{p 13 8 2}
= {cmd:a * }{it:timeseriesvar} + {cmd:(1 - a) * L.}{it:ewma} otherwise
{p 8 8 2}
The data must have been declared time series data by {help tsset}.
Calculations start afresh after any gap with missing values.
(Stata 6 required; superseded by {help tssmooth}.)
{p 4 8 2}
{cmd:filter(}{it:timeseriesvar}{cmd:) ,}
{cmdab:l:ags(}{it:numlist}{cmd:)}
[
{cmdab:c:oef(}{it:numlist}{cmd:)}
{c -(}
{cmdab:n:ormalise}
{c |}
{cmdab:n:ormalize}
{c )-}
]
calculates the linear filter which is the sum of terms
{p 8 8 2}
{it:coef_i} {cmd:* L}{it:i.timeseriesvar} or {it:coef_i} {cmd:* F}{it:i.timeseriesvar}
{p 8 8 2}
{cmd:coef()} defaults to a vector the same length as {cmd:lags()} with each
element 1.
{p 8 8 2}
{cmd:filter(y), l(0/3) c(0.4(0.1)0.1)} calculates
{p 8 8 2}
{cmd:0.4 * y + 0.3 * L1.y + 0.2 * L2.y + 0.1 * L3.y}
{p 8 8 2}
{cmd:filter(y), l(0/3)} calculates
{p 8 8 2}
{cmd:1 * y + 1 * L1.y + 1 * L2.y + 1 * L3.y} or {cmd:y + L1.y + L2.y + L3.y}
{p 8 8 2}
Leads are specified as negative lags. {cmd:normalise} (or {cmd:normalize},
according to taste) specifies that coefficients are to be divided by
their sum so that they add to 1 and thus specify a weighted mean.
{p 8 8 2}
{cmd:filter(y), l(-2/2) c(1 4 6 4 1) n} calculates
{p 8 8 2}
{cmd:(1/16) * F2.y + (4/16) * F1.y + (6/16) * y}
{cmd:+ (4/16) * L1.y + (1/16) * L2.y}
{p 8 8 2}
The data must have been declared time series data by {help tsset}.
Note that this may include panel data, which are automatically
filtered separately within each panel.
{p 8 8 2}
The order of terms in {cmd:coef()} is taken to be the same as that in
{cmd:lags}. (Stata 8 required; see also {help tssmooth}.)
{p 4 8 2}{cmd:. egen f2y = filter(y), l(-1/1) c(0.25 0.5 0.25)}{p_end}
{p 4 8 2}{cmd:. egen f2y = filter(y), l(-1/1) c(1 2 1) n}
{p 4 8 2}
{cmd:filter7(}{it:timeseriesvar}{cmd:) ,}
{cmdab:l:ags(}{it:numlist}{cmd:)}
{cmdab:c:oef(}{it:numlist}{cmd:)}
[
{c -(}
{cmdab:n:ormalise}
{c |}
{cmdab:n:ormalize}
{c )-}
]
calculates the linear filter which is the sum of terms
{p 8 8 2}
{it:coef_i} {cmd:* L}{it:i.timeseriesvar} or {it:coef_i }{cmd:* F}{it:i.timeseriesvar}
{p 8 8 2}
{cmd:filter7(y), l(0/3) c(0.4(0.1)0.1)} calculates
{p 8 8 2}
{cmd:0.4 * y + 0.3 * L1.y + 0.2 * L2.y + 0.1 * L3.y}
{p 8 8 2}
Leads are specified as negative lags. {cmd:normalise} (or {cmd:normalize},
according to taste) specifies that coefficients are to be divided by
their sum so that they add to 1 and thus specify a weighted mean.
{p 8 8 2}
{cmd:filter7(y), l(-2/2) c(1 4 6 4 1) n} calculates
{p 8 8 2}
{cmd:(1/16) * F2.y + (4/16) * F1.y + (6/16) * y}
{cmd:+ (4/16) * L1.y + (1/16) * L2.y}
{p 8 8 2}
The data must have been declared time series data by {help tsset}.
Note that this may include panel data, which are automatically
filtered separately within each panel.
{p 8 8 2}
The order of terms in {cmd:coef()} is taken to be the same as that in
{cmd:lags()}. (Stata 7 required; see also {help tssmooth}.)
{p 4 8 2}
{cmd:foy(}{it:daily_date_variable}{cmd:)}
[
{cmd:,}
{cmdab:m:onth(}{it:#}{cmd:)}
{cmdab:d:ay(}{it:#}{cmd:)}
]
generates the fraction of the year elapsed since the
start of the year from a daily date variable. The
start of the year is 1 January by default: {cmd:month()}
and/or {cmd:day()} may be used to specify an alternative.
If {it:daily_date_variable}
is all integers, then the result is {bind:(day of year - 0.5)} /
number of days in year. If {it:daily_date_variable}
contains non-integers, then the result is
{bind:(day of year - 1)} / number of days in year.
(Stata 8 required.)
{p 4 8 2}{cmd:. egen frac = foy(date), m(10)}
{p 4 8 2}
{cmd:hmm(}{it:timevar}{cmd:)}
[
{cmd:,}
{cmdab:r:ound(}{it:#}{cmd:)}
{cmdab:t:rim}
]
generates a string variable showing {it:timevar}, interpreted as
indicating time in minutes, represented as hours and minutes in the form
{cmd:"}[...{it:h}]{it:h}{cmd::}{it:mm}{cmd:"}. For example, times of
{cmd:9}, {cmd:90}, {cmd:900} and {cmd:9000} minutes would be represented
as {cmd:"0:09"},{cmd:"1:30"}, {cmd:"15:00"} and {cmd:"150:00"}. The
option {cmd:round(}{it:#}{cmd:)} rounds the result: {cmd:round(1)}
rounds the time to the nearest minute. The option {cmd:trim} trims the
result of leading zeros and colons, except that an isolated {cmd:0} is
not trimmed. With {cmd:trim} {cmd:"0:09"} is trimmed to {cmd:"9"} and
{cmd:"0:00"} is trimmed to {cmd:"0"}.
{p 8 8 2}
{cmd:hmm()} serves equally well for representing times in seconds in
minutes and seconds in the form
{cmd:"}[...{it:m}]{it:m}{cmd::}{it:ss}{cmd:"}. (Stata 6 required.)
{p 4 8 2}
{cmd:hmmss(}{it:timevar}{cmd:)}
[
{cmd:,}
{cmdab:r:ound(}{it:#}{cmd:)}
{cmdab:t:rim}
]
generates a string variable showing {it:timevar}, interpreted as
indicating time in seconds, represented as hours, minutes and seconds in
the form {cmd:"}[...{it:h}{cmd::}]{it:mm}{cmd::}{it:ss}{cmd:"}. For
example, times of {cmd:9}, {cmd:90}, {cmd:900} and {cmd:9000} seconds
would be represented as {cmd:"00:09"},{cmd:"01:30"}, {cmd:"15:00"} and
{cmd:"2:30:00"}. The option {cmd:round(}{it:#}{cmd:)} rounds the result:
{cmd:round(1)} rounds the time to the nearest second. The option
{cmd:trim} trims the result of leading zeros and colons, except that an
isolated {cmd:0} is not trimmed. With {cmd:trim} {cmd:"00:09"} is
trimmed to {cmd:"9"} and {cmd:"00:00"} is trimmed to {cmd:"0"}. (Stata 6
required.)
{p 4 8 2}
{cmd:hms(}{it:h m s}{cmd:)}
[
{cmd:,}
{cmdab:f:ormat(}{it:format}{cmd:)}
]
creates an elapsed time variable containing the number of seconds past
midnight. {it:h} can be a variable containing integers between 0 and 23
inclusive or a single integer in that range. {it:m} and {it:s} can be variables
containing integers between 0 and 59 or single integer(s) in that range.
Optionally a format can be specified. (Stata 6 required.)
{p 4 8 2}
{cmd:minutes(}{it:strvar}{cmd:)}
[
{cmd:,}
{cmd:maxhour(}{it:#}{cmd:)}
]
returns time in minutes given a string variable {it:strvar} containing a
time in hours and minutes in the form
{cmd:"}[..{it:h}]{it:hh}:{it:mm}{cmd:"}. In particular, minutes are
given as two digits between 00 and 59 and hours by default are given as
two digits between 00 and 23. The {cmd:maxhour()} option may be used to
change the (unreachable) limit: its default is 24. Note that, strange
though it may seem, this function rather than {cmd:seconds()} is
appropriate for converting times in the form
{cmd:"}{it:mm}:{it:ss}{cmd:"} to seconds. The maximum number of minutes
acceptable may need then to be specified by {cmd:maxhour()} [sic].
(Stata 8 required.)
{p 4 8 2}
{cmd:ncyear(}{it:datevar}{cmd:)}
{cmd:,}
{cmdab:m:onth(}{it:#}{cmd:)}
[
{cmdab:d:ay(}{it:#}{cmd:)}
]
returns an integer variable labelled with labels such as {cmd:"1952/53"}
for non-calendar years starting on the specified month and day. The day
defaults to 1. {it:datevar} is treated as indicating elapsed dates. For
more on dates, see help on {help dates}. (Stata 6 required.)
{p 4 8 2}{cmd:. egen wtryear = ncyear(date), m(10)}{p_end}
{p 4 4 2}(years starting on 1 October)
{p 4 8 2}{cmd:. egen wwgyear = ncyear(date), m(1) d(21)}{p_end}
{p 4 4 2}(years starting on 21 January)
{p 4 8 2}
{cmd:record(}{it:exp}{cmd:)}
[
{cmd:,}
{cmd:by(}{it:byvarlist}{cmd:)}
{cmd:min}
{cmd:order(}{it:varlist}{cmd:)}
]
produces the maximum (with {cmd:min} the minimum) value observed "to date" of
the specified {it:exp}. Thus {cmd:record(wage), by(id) order(year)} produces
the maximum wage so far in a worker's career, calculations being separate for
each {cmd:id} and records being determined within each {cmd:id} in {cmd:year}
order. Although explanation and example here refer to dates, nothing in
{cmd:record()} restricts its use to data ordered in time. If not otherwise
specified with {cmd:by()} and/or {cmd:order()}, records are determined with
respect to the current order of observations. No special action is required for
missing values, as internally {cmd:record()} uses either the {cmd:max()} or the
{cmd:min()} function, both of which return results of missing only if all
values are missing. (Stata 6 required.)
{p 4 8 2}{cmd:. egen hiwage = record(exp(lwage)), by(id) order(year)}{p_end}
{p 4 8 2}{cmd:. egen lowage = record(exp(lwage)), by(id) order(year) min}
{p 4 8 2}
{cmd:seconds(}{it:strvar}{cmd:)}
[
{cmd:,}
{cmd:maxhour(}{it:#}{cmd:)}
]
returns time in seconds given a string variable containing a time in hours,
minutes and seconds in the form
{cmd:"}[..{it:h}]{it:hh}{cmd::}{it:mm}{cmd::}{it:ss}{cmd:"}.
In particular, minutes and seconds are each given as two digits between
00 and 59 and hours by default are given as two digits between 00 and
23. The {cmd:maxhour()} option may be used to change the (unreachable)
limit: its default is 24. (Stata 8 required.)
{p 4 8 2}
{cmd:tod(}{it:time}{cmd:)}
[
{cmd:,}
{cmdab:f:ormat(}{it:format}{cmd:)}
]
creates a string variable which contains the number of hours, minutes and
seconds associated with an integer in the range 0 to 86399, one less than the
number of seconds in a day. Such a variable is produced by {cmd:hms()}, which
see above. Leading zeroes are included in the hours, minutes, and seconds
fields. Colons are used as separators. Optionally a format can be specified.
(Stata 6 required.)
{title:Summaries and estimates}
{p 4 8 2}
{cmd:adjl(}{it:varname}{cmd:)}
[
{cmd:,}
{cmd:by(}{it:byvarlist}{cmd:)}
{cmdab:fact:or(}{it:#}{cmd:)}
]
calculates adjacent lower values. These are the smallest values within
{cmd:factor()} times the interquartile range of the lower quartile.
By default {cmd:factor()} is 1.5, defining the default lower value
of a so-called whisker on a Stata box plot. (Stata 8 required.)
{p 4 8 2}
{cmd:adju(}{it:varname}{cmd:)}
[
{cmd:,}
{cmd:by(}{it:byvarlist}{cmd:)}
{cmdab:fact:or(}{it:#}{cmd:)}
]
calculates adjacent upper values. These are the largest values within
{cmd:factor()} times the interquartile range of the upper quartile.
By default {cmd:factor()} is 1.5, defining the default upper value
of a so-called whisker on a Stata box plot. (Stata 8 required.)
{p 4 8 2}{cmd:. egen adjl = adjl(mpg), by(foreign)}{p_end}
{p 4 8 2}{cmd:. egen adju = adju(mpg), by(foreign)}
{p 4 8 2}
{cmd:corr(}{it:varname1 varname2}{cmd:)}
[
{cmd:,}
{cmdab:c:ovariance}
{cmdab:s:pearman}
{cmd:taua}
{cmd:taub}
{cmd:by(}{it:byvarlist}{cmd:)}
]
returns the correlation of {it:varname1} with {it:varname2}. By
default, this returns the Pearson correlation coefficient. {cmd:covariance}
indicates that covariances should be calculated; {cmd:spearman}
indicates that Spearman's rank correlation coefficient should be
calculated; {cmd:taua} and {cmd:taub} return Kendall's tau-A and tau-B,
respectively. (Stata 8 required.)
{p 4 8 2}
{cmd:d2(}{it:exp}{cmd:)}
[
{cmd:,}
{cmdab:w:eights(}{it:exp}{cmd:)}
{cmd:by(}{it:byvarlist}{cmd:)}
]
returns the mean absolute deviation from the median (within varlist) of {it:exp},
allowing specification of weights. The function creates a constant (within {it:byvarlist})
containing the mean of abs({it:exp} - median({it:exp})). (Stata 10.1 required.)
{p 4 8 2}
{cmd:density(}{it:varname}{cmd:)}
[
{cmd:,}
{cmdab:w:idth(}{it:#}{cmd:)}
{cmdab:st:art(}{it:#}{cmd:)}
{cmdab:freq:uency}
{cmd:percent}
{cmdab:frac:tion}
{cmd:by(}{it:byvarlist}{cmd:)}
]
calculates the density (or optionally the {cmd:frequency},
{cmd:fraction} or {cmd:percent}) of values in bins of width
{cmd:width()} (default 1) starting at {cmd:start()} (default minimum of
the data). Note that each value produced will be identical for all
observations in the same bin. Commonly for further use it will be
desired to select one value from each bin, say by using {help egen}'s
{cmd:tag()} function. (Stata 8 required.)
{p 4 8 2}
{cmd:gmean(}{it:exp}{cmd:)}
[
{cmd:,}
{cmd:by(}{it:byvarlist}{cmd:)}
]
returns the geometric mean of {it:exp}. (Stata 6 required.)
{p 4 8 2}{cmd:. egen gmean = gmean(mpg), by(rep78)}
{p 4 8 2}
{cmd:hmean(}{it:exp}{cmd:)}
[
{cmd:, by(}{it:byvarlist}{cmd:)}
]
returns the harmonic mean of {it:exp}. (Stata 6 required.)
{p 4 8 2}{cmd:. egen hmean = hmean(mpg), by(rep78)}
{p 4 8 2}
{cmd:nmiss(}{it:exp}{cmd:)}
[
{cmd:,}
{cmd:by(}{it:byvarlist}{cmd:)}
]
returns the number of missing values in {it:exp}. (Stata 6 required.)
Remark: Why this was written is a mystery. The one-line command
{cmd:egen nmiss = sum(missing(}{it:exp}{cmd:)}
(in Stata 9 {cmd:egen nmiss = total(missing(}{it:exp}{cmd:)})
shows that it is unnecessary.
{p 4 8 2}{cmd:. egen nmiss = nmiss(rep78), by(foreign)}
{p 4 8 2}
{cmd:nvals(}{it:varname}{cmd:)}
[
{cmd:,}
{cmd:by(}{it:byvarlist}{cmd:)}
{cmdab:miss:ing}
]
returns the number of distinct values in {it:varname}. Missing values
are ignored unless {cmd:missing} is specified.
Remark: Much can be done by using {help egen} function {cmd:tag()}
and then summing values as desired. See also {cmd:distinct} (Cox and Longton 2008).
(Stata 6 required.)
{p 4 8 2}
{cmd:outside(}{it:varname}{cmd:)}
[
{cmd:,}
{cmd:by(}{it:byvarlist}{cmd:)}
{cmdab:fact:or(}{it:#}{cmd:)}
]
calculates outside values. These are any values more than {cmd:factor()}
times the interquartile range from the nearer quartile, that is above
the upper quartile or below the lower quartile. By default
{cmd:factor()} is 1.5, defining the default outside values, those
plotted separately, on a Stata box plot.
Values not outside are returned as missing.
(Stata 8 required.)
{p 4 8 2}
{cmd:ridit(}{it:varname}{cmd:)}
[
{cmd:,}
{cmd:by(}{it:byvarlist}{cmd:)}
{cmdab:miss:ing}
{cmdab:perc:ent}
{cmdab:rev:erse}
]
calculates the ridit for {it:varname}, which is
{space 8}(1/2) count at this value + SUM counts in values below
{space 8}{hline 54}
{space 23}SUM counts of all values
{p 8 8 2}
With terminology from Tukey (1977, pp.496-497), this could be called a
`split fraction below'. The name `ridit' was used by Bross (1958):
see also Fleiss (1981, pp.150-7) or Flora (1988). The numerator is a
`split count'.
{p 8 8 2}
{cmd:missing} specifies that observations for which values of {it:byvarlist}
are missing will be included in calculations if {cmd:by()} is specified. The
default is to exclude them. {cmd:percent} scales the numbers to percents by
multiplying by 100. {cmd:reverse} specifies the use of reverse cumulative
probabilities (1 - fraction above). (Stata 6 required.)
{p 4 8 2}
{cmd:semean(}{it:exp}{cmd:)}
[
{cmd:,}
{cmd:by(}{it:byvarlist}{cmd:)}
]
calculates the standard error of the mean of {it:exp}. (Stata 6
required.)
{p 4 8 2}
{cmd:sumoth(}{it:exp}{cmd:)}
[
{cmd:,}
{cmd:by(}{it:byvarlist}{cmd:)}
]
returns the sum of the other values of {it:exp} in the same group. If
{cmd:by()} is specified, distinct combinations of {it:byvarlist} define groups;
otherwise all observations define one group. (Stata 6 required.)
{p 4 8 2}
{cmd:var(}{it:exp}{cmd:)}
[
{cmd:,}
{cmd:by(}{it:byvarlist}{cmd:)}
]
creates a constant (within {it:byvarlist}) containing the variance of {it:exp}.
Note also the {help egen} function {cmd:sd()}. (Stata 6 required.)
{p 4 8 2}
{cmd:wpctile(}{it:varname}{cmd:)}
[
{cmd:,}
{cmd:p(}{it:#}{cmd:)}
{cmdab:w:eights(}{it:varname}{cmd:)}
{cmdab:alt:def}
{cmd:by(}{it:byvarlist}{cmd:)}
]
is a hack on official Stata's {cmd:egen} function {cmd:pctile()}
allowing specification of weights in the calculation of percentiles. By
default, the function creates a constant (within {it:byvarlist})
containing the {it:#}th percentile of {it:varname}. If {cmd:p()} is not
specified, 50 is assumed, meaning medians. {cmd:weights()} requests
weighted calculation of percentiles. {cmd:altdef} uses an alternative
formula for calculating percentiles, which is not applicable with
weights present. {cmd:by()} requests calculation by groups. You may
also use the {cmd:by:} construct. (Stata 8.2 required.)
{p 4 8 2}
{cmd:wtfreq(}{it:exp}{cmd:)}
[
{cmd:,}
{cmd:by(}{it:byvarlist}{cmd:)}
]
creates a constant (within {it:byvarlist})
containing the weighted frequency using {it:exp} as weights. (Such
frequencies sum to {cmd:_N}.) (Stata 6 required.)
{p 4 8 2}
{cmd:xtile(}{it:varname}{cmd:)}
[
{cmd:,}
{cmdab:p:ercentiles(}{it:numlist}{cmd:)}
{cmdab:n:quantiles(}{it:#}{cmd:)}
{cmdab:w:eights(}{it:varname}{cmd:)}
{cmdab:alt:def}
{cmd:by(}{it:byvarlist}{cmd:)}
]
categorizes {it:varname} by specific percentiles. The function works
like {help xtile}. By default {it:varname} is dichotomized at the
median. {cmd:percentiles()} requests percentiles corresponding to
{it:numlist}: for example, {cmd:p(25(25)75)} is used to create a
variable according to quartiles. Alternatively you also may have
specified {cmd:n(4)}: to create a variable according to quartiles.
{cmd:weights()} requests weighted calculation of percentiles.
{cmd:altdef} uses an alternative formula for calculating percentiles.
See {help xtile}. {cmd:by()} requests calculation by groups. You may
also use the {cmd:by:} construct. (Stata 8.2 required.)
{p 4 8 2}{cmd:. egen mpg4 = xtile(mpg), by(foreign) p(25(25)75)}{p_end}
{p 4 8 2}{cmd:. egen mpg10 = xtile(mpg), by(foreign) nq(10)}
{title:First and last}
{p 4 8 2}
{cmd:first(}{it:varname}{cmd:)}
[
{cmd:,}
{cmd:by(}{it:byvarlist}{cmd:)}
]
returns the first non-missing value of {it:varname}. `First' depends on the
existing order of observations. {it:varname} may be numeric or string.
(Stata 6 required.)
{p 4 8 2}
{cmd:ifirst(}{it:numvar}{cmd:)}
{cmd:,}
{cmdab:v:alue(}{it:#}{cmd:)}
[
{c -(}
{cmdab:be:fore}
{c |}
{cmdab:a:fter}
{c )-}
{cmd:by(}{it:byvarlist}{cmd:)}
]
indicates the first occurrence of integer {it:#} within {it:numvar}
by 1 and other observations by 0.
{p 8 8 2}
{cmd:before} indicates observations before the first occurrence by 1
and other observations by 0.
{cmd:after} indicates observations after the first occurrence by 1
and other observations by 0.
The default, the value {cmd:before} and the value {cmd:after}
always sum to 1 for observations analysed.
{p 8 8 2}
First occurrence is determined as follows: (1) if {cmd:if} or {cmd:in} is
specified, any observations excluded are ignored; (2) if {cmd:by()} is
specified, first is determined separately for each distinct group of
observations; (3) first is first in current sort order.
If {it:#} does not occur, all observations
are before the first occurrence. (Stata 6 required.)
{p 4 8 2}{cmd:. gen warm = celstemp > 20}{p_end}
{p 4 8 2}{cmd:. egen fwarm = ifirst(warm), v(1) by(year)}
{p 4 8 2}
{cmd:ilast(}{it:numvar}{cmd:)}
{cmd:,}
{cmdab:v:alue(}{it:#}{cmd:)}
[
{c -(}
{cmdab:be:fore}
{c |}
{cmdab:a:fter}
{c )-}
{cmd:by(}{it:byvarlist}{cmd:)}
]
indicates the last occurrence of integer {it:#} within {it:numvar} by 1 and
other observations by 0.
{p 8 8 2}
{cmd:before} indicates observations before the last occurrence by 1
and other observations by 0.
{cmd:after} indicates observations after the last occurrence by 1
and other observations by 0.
The default, the value {cmd:before} and the value {cmd:after}
always sum to 1 for observations analysed.
{p 8 8 2}
Last occurrence is determined as follows: (1) if {cmd:if} or {cmd:in} is
specified, any observations excluded are ignored; (2) if {cmd:by()} is
specified, last is determined separately for each distinct group of
observations; (3) last is last in current sort order.
If {it:#} does not occur, all
observations are before the last occurrence. (Stata 6 required.)
{p 4 8 2}
{cmd:lastnm(}{it:varname}{cmd:)}
[
{cmd:,}
{cmd:by(}{it:byvarlist}{cmd:)}
]
returns the last non-missing value of {it:varname}. `Last' depends on
the existing order of observations. {it:varname} may be numeric or
string. Remark: {cmd:lastnm()} would have been better called
{cmd:last()}, except that an {cmd:egen} program with that name for
selecting the last `word' in a string was published in STB-50.
(Stata 6 required.)
{title:Random numbers}
{p 4 8 2}
{cmd:mixnorm()}
[
{cmd:,}
{cmd:frac(}{it:#}{cmd:)}
{cmd:mu1(}{it:#}{cmd:)}
{cmd:mu2(}{it:#}{cmd:)}
{cmd:var1(}{it:#}{cmd:)}
{cmd:var2(}{it:#}{cmd:)}
]
generates a new variable of specified type as
a mixture of two Normal distributions, with the fraction
{cmd:frac(}{it:#}{cmd:)} of the observations defined by the first
distribution. Both options for means {cmd:mu1(}{it:#}{cmd:)} and
{cmd:mu2(}{it:#}{cmd:)} default to 0; both options for variances
{cmd:var1(}{it:#}{cmd:)} and {cmd:var2(}{it:#}{cmd:)} default to 1,
while {cmd:frac(}{it:#}{cmd:)} defaults to 0.5. Only non-default
parameters of the desired mixture need be specified. (Stata 8 required.)
{p 4 8 2}{cmd:. egen mixture = mixnorm(), frac(0.9) mu2(10) var2(4)}
{p 4 8 2}
{cmd:rndint()}
{cmd:,}
{cmdab:ma:x(}{it:#}{cmd:)}
[
{cmdab:mi:n(}{it:#}{cmd:)}
]
generates random integers from a uniform distribution on {cmd:min()} to
{cmd:max()}, inclusive. {cmd:min(1)} is the default.
Remark: Note that {cmd:ceil(uniform() * }{it:#}{cmd:)} is a direct way
to get random integers from 1 to {it:#}. (Stata 6 required.)
{p 4 8 2}{cmd:. egen integ = rndint(), min(100) max(199)}{p_end}
{p 4 8 2}
{cmd:rndsub()}
[
{cmd:,}
{cmdab:ng:roup(}{it:#}{cmd:)}
{c -(}
{cmdab:f:rac(}{it:#}{cmd:)}
{c |}
{cmdab:p:ercent(}{it:#}{cmd:)}
{c )-}
{cmd:by(}{it:byvarlist}{cmd:)}
]
randomly splits observations into groups or subsamples. The result is a
categorical variable taking values from 1 upward labelling distinct groups.
{p 8 8 2}
{cmd:ngroup(}{it:#}{cmd:)} (default 2) defines the number of groups.
{p 8 8 2}
{cmd:frac(}{it:#}{cmd:)}, which is only allowed with {cmd:ngroup(2)}, specifies that
the first group should contain 1 / {it:#} of the observations and thus that
the second group should contain the remaining observations.
{p 8 8 2}
{cmd:percent(}{it:#}{cmd:)}, which is only allowed with {cmd:ngroup(2)},
specifies that the first group should contain {it:#}% of the observations and thus that
the second group should contain the remaining observations.
{p 8 8 2}
{cmd:frac()} and {cmd:percent()} may not be specified together.
(Stata 6 required.)
{p 4 8 2}{cmd:. egen group = rndsub(), by(foreign)}{p_end}
{p 4 8 2}{cmd:. egen group = rndsub(), by(foreign) f(3)}{p_end}
{p 4 4 2}(first group contains 1/3 of observations, second group contains 2/3)
{p 4 8 2}{cmd:. egen group = rndsub(), by(foreign) p(25)}{p_end}
{p 4 8 2}(first group contains 25% of observations, second group contains 75%)
{p 4 4 2}
For reproducible results, set the seed of the random number generator
beforehand and document your choice.
{p 4 4 2}
Note that to generate {it:#} random numbers the number of observations must be
at least {it:#}. If there are no data in memory and you want 100 random
numbers, type {cmd:set obs 100} before using these functions.
{title:Row operations}
{p 4 8 2}
{cmd:rall(}{it:varlist}{cmd:)}
{cmd:,}
{cmdab:c:ond(}{it:condition}{cmd:)}
[
{cmdab:sy:mbol(}{it:symbol}{cmd:)}
]
returns 1 for observations for which the condition specified is true for
all variables in {it:varlist} and 0 otherwise. The condition should be
specified using {cmd:symbol()}, by default {cmd:@}, as a placeholder for each
variable. Thus, for example,
{cmd:rall(}{it:varlist}{cmd:), c(@ > 0 & @ < .)}
tests whether all variables in {it:varlist} are positive and
non-missing. Note that conditions typically make sense only if variables
are either all numeric or all string: one exception is {cmd:missing(@)}.
(Stata 6 required.)
{p 4 8 2}
{cmd:rany(}{it:varlist}{cmd:)}
{cmd:,}
{cmdab:c:ond(}{it:condition}{cmd:)}
[
{cmdab:sy:mbol(}{it:symbol}{cmd:)}
]
returns 1 for observations for which the condition specified is true for
any variable in {it:varlist} and 0 otherwise. The condition should be
specified using {cmd:symbol()}, by default {cmd:@}, as a placeholder for each
variable. Thus, for example, {cmd:rany(}{it:varlist}{cmd:), c(@ > 0 & @ < .)}
tests whether any variable in {it:varlist} is positive and non-missing.
Note that conditions typically make sense only if variables are either
all numeric or all string: one exception is {cmd:missing(@)}.
(Stata 6 required.)
{p 4 8 2}
{cmd:rcount(}{it:varlist}{cmd:)}
{cmd:,}
{cmdab:c:ond(}{it:condition}{cmd:)}
[
{cmdab:sy:mbol(}{it:symbol}{cmd:)}
]
returns the number of variables in {it:varlist} for which the condition
specified is true. The condition should be specified using {cmd:symbol()}, by
default {cmd:@}, as a placeholder for each variable. Thus, for example,
{cmd:rcount(}{it:varlist}{cmd:), c(@ > 0 & @ < .)} counts for each observation how
many variables in {it:varlist} are positive and non-missing. Note that
conditions typically make sense only if variables are either all numeric or all
string: one exception is {cmd:missing(@)}. More precisely, {cmd:rcount()}
gives the sum across {it:varlist} of condition, evaluated in turn for each
variable. (Stata 6 required.)
{p 4 4 2}
For {cmd:rall()}, {cmd:rany()}, and {cmd:rcount()}, the {cmd:symbol()} option
may be used to set an alternative to {cmd:@} whenever the latter is
inappropriate. For example, if string variables were being searched for literal
occurrences of {cmd:"@"}, some other symbol not appearing in text or in
variable names should be used.
{p 4 8 2}{cmd:. egen any = rany(b c d e f) , c(@ == a)}{p_end}
{p 4 8 2}{cmd:. egen all = rall(b c d e f) , c(@ == a)}{p_end}
{p 4 8 2}{cmd:. egen count = rcount(b c d e f) , c(@ == a)}{p_end}
{p 4 4 2}(values of {cmd:b c d e f} matched by (equal to) those of {cmd:a}?)
{p 4 8 2}{cmd:. egen anyw1 = rany(b c d e f) , c(abs(@ - a) <= 1)}{p_end}
{p 4 8 2}{cmd:. egen allw1 = rall(b c d e f) , c(abs(@ - a) <= 1)}{p_end}
{p 4 8 2}{cmd:. egen countw1 = rcount(b c d e f) , c(abs(@ - a) <= 1)}{p_end}
{p 4 4 2}(values of {cmd:b c d e f} within 1 of those of {cmd:a}?)
{p 4 4 2}
From Stata 7, {help foreach} provides an alternative that would now be
considered better style:
{p 4 8 2}{cmd:. gen any = 0}{p_end}
{p 4 8 2}{cmd:. gen all = 1}{p_end}
{p 4 8 2}{cmd:. gen count = 0}{p_end}
{p 4 8 2}{cmd:. foreach v of var a b c d e f {c -(}}{p_end}
{p 4 8 2}{cmd:. {space 8}replace any = max(any, inrange(`v', 0, .))}{p_end}
{p 4 8 2}{cmd:. {space 8}replace all = min(all, inrange(`v', 0, .))}{p_end}
{p 4 8 2}{cmd:. {space 8}replace count = count + inrange(`v', 0, .)}{p_end}
{p 4 8 2}{cmd:. {c )-}}{p_end}
{p 4 8 2}
{cmd:rowmedian(}{it:varlist}{cmd:)}
returns the median across observations of the variables in {it:varlist}.
(Stata 9 required.) (Note: official Stata added a {cmd:rowmedian()}
function in Stata 11, which always trumps this one.)
{p 4 8 2}
{cmd:rownvals(}{it:numvarlist}{cmd:)} [ {cmd:,} {cmdab:miss:ing} ]
returns the number of distinct values in each observation for a set of
numeric variables {it:numvarlist}. Thus if the values in one observation for
five numeric variables are 1, 1, 2, 2, 3 the function returns 3 for
that observation. Missing values, i.e. any of . .a ... .z, are ignored
unless the {cmd:missing} option is specified. (Stata 9 required.)
{p 4 8 2}
{cmd:rowsvals(}{it:strvarlist}{cmd:)} [ {cmd:,} {cmdab:miss:ing} ]
returns the number of distinct values in each observation for a set of
string variables {it:strvarlist}. Thus if the values in one observation for
five string variables are "frog", "frog", "toad", "toad", "newt" the function returns 3 for
that observation. Missing values, i.e. empty strings "", are ignored
unless the {cmd:missing} option is specified. (Stata 9 required.)
{p 4 8 2}
{cmd:rsum2(}{it:varlist}{cmd:)} is a generalisation of {help egen}'s
{cmd:rsum()} (from Stata 9: {cmd:rowtotal()}) function with the extra
options {cmdab:allm:iss} and {cmdab:anym:iss}.
As with {cmd:rsum()}, it creates the (row) sum of the variables in {it:varlist},
treating missing as 0. However, if the option {cmd:allmiss} is selected, the
(row) sum for any observation for which all variables in {it:varlist} are
missing is set equal to missing. Similarly, if the option {cmd:anymiss} is
selected the (row) sum for any observation for which any variable in
{it:varlist} is missing is set equal to missing. (Stata 6 required.)
{title:References}
{p 4 8 2}
Bross, I.D.J. 1958. How to use ridit analysis. {it:Biometrics} 14: 18{c -}38.
{p 4 8 2}
Cox, N.J. 2008. Speaking Stata: Between tables and graphs.
{it:Stata Journal} 8(2): 269{c -}289.
{p 4 8 2}
Cox, N.J. and G. M. Longton. 2008.
Speaking Stata: Distinct observations.
{it:Stata Journal} 8(4): 557{c -}568.
{p 4 8 2}
Fleiss, J.L. 1981. {it:Statistical Methods for Rates and Proportions.}
New York: John Wiley.
{p 4 8 2}
Flora, J.D. 1988. Ridit analysis. In Kotz, S. and Johnson, N.L. (eds)
{it:Encyclopedia of Statistical Sciences.} New York: John Wiley. 8: 136{c -}139.
{p 4 8 2}
Tukey, J.W. 1977. {it:Exploratory Data Analysis.} Reading, MA: Addison-Wesley.
{title:Maintainer}
{p 4 4 2}Nicholas J. Cox, Durham University, U.K.{break}
n.j.cox@durham.ac.uk
{title:Acknowledgements}
{p 4 4 2}
Kit Baum (baum@bc.edu) is the first author of {cmd:record()} and the
author of {cmd:dhms()}, {cmd:elap()}, {cmd:elap2()}, {cmd:hms()},
{cmd:tod()}, {cmd:mixnorm()} and {cmd:truncdig()}.
{p 4 4 2}
Ulrich Kohler (kohler@wzb.eu) is the author of {cmd:xtile()},
{cmd:mlabvpos()}, {cmd:iso3166()} and {cmd:wpctile()}.
{p 4 4 2}
Pablo A. Mitnik (pmitnik@stanford.edu) is the author of {cmd:d2()}.
{p 4 4 2}
Steven Stillman (s.stillman@verizon.net) is the author of {cmd:rsum2()}.
{p 4 4 2}
Nick Winter (njw3x@virginia.edu) is the author of {cmd:corr()} and
{cmd:noccur()}.
{p 4 4 2}
Kit Baum, Sascha Becker, Ron{c a'}n Conroy, William Gould, Syed Islam,
Ariel Linden,
John Moran,
Stephen Soldz, Richard Williams, Fred Wolfe and Gerald Wright
provided stimulating and helpful comments.
{title:Also see}
{p 4 13 2}STB: STB-50 dm70 for {cmd:atan2()}, {cmd:pp()}, {cmd:rev()}, {cmd:rindex()}, {cmd:rmed()}, {cmd:rotate()}
{p 4 13 2}Manual: [D] egen (before Stata 9 [R] egen)
{p 4 13 2}On-line: help for
{help egen},
{help dates},
{help functions},
{help means},
{help numlist},
{help seed},
{help tsset},
{help varlist} (timeseries operators),
{help circular} (if installed),
{help ntimeofday} (if installed),
{help stimeofday} (if installed)