{smcl} {* *! version 1.4.2 30Jan2020}{...} {viewerdialog gegen "dialog gegen"}{...} {vieweralsosee "[R] gegen" "mansection R gegen"}{...} {viewerjumpto "Syntax" "gegen##syntax"}{...} {viewerjumpto "Description" "gegen##description"}{...} {viewerjumpto "Options" "gegen##options"}{...} {viewerjumpto "Stored results" "gegen##results"}{...} {title:Title} {p2colset 5 18 23 2}{...} {p2col :{cmd:gegen} {hline 2}}Efficient implementation of by-able egen functions using C.{p_end} {p2colreset}{...} {pstd} {it:Important}: Please run {stata gtools, upgrade} to update {cmd:gtools} to the latest stable version. {marker syntax}{...} {title:Syntax} {p 8 14 2} {cmd:gegen} {dtype} {newvar} {cmd:=} {it:fcn}({it:arguments}) {ifin} [{it:{help gegen##weight:weight}}] [{cmd:,} {opt replace} {it:fcn_options} {help gegen##gtools_options:gtools_options}] {synoptset 21 tabbed}{...} {marker gtools_options}{...} {synopthdr} {synoptline} {syntab:Gtools} {synopt :{opt compress}}Try to compress strL to str#. {p_end} {synopt :{opt forcestrl}}Skip binary variable check and force gtools to read strL variables. {p_end} {synopt :{opt v:erbose}}Print info during function execution. {p_end} {synopt :{opt bench:mark}}Benchmark various steps of the plugin. {p_end} {synopt :{opt bench:marklevel(int)}}Benchmark various steps of the plugin. {p_end} {synopt :{opth hash:method(str)}}Hash method (default, biject, or spooky). Intended for debugging. {p_end} {synopt :{opth oncollision(str)}}Collision handling (fallback or error). Intended for debugging. {p_end} {synopt :{opth gtools_capture(str)}}The above options are captured and not passed to {opt egen} in case the requested function is not internally supported by gtools. You can pass extra arguments here if their names conflict with captured gtools options. {p_end} {synoptline} {marker weight}{...} {p 4 6 2} {opt aweight}s, {opt fweight}s, {opt iweight}s, and {opt pweight}s are allowed for the functions listed below and mimic {cmd:collapse} and {cmd:gcollapse}; see {help weight} and {help collapse##weights:Weights (collapse)}. {opt pweight}s may not be used with {opt sd}, {opt variance}, {opt cv}, {opt semean}, {opt sebinomial}, or {opt sepoisson}. {opt iweight}s may not be used with {opt semean}, {opt sebinomial}, or {opt sepoisson}. {opt aweight}s may not be used with {opt sebinomial} or {opt sepoisson}.{p_end} {pstd} The following are simply wrappers for other {it:gtools} functions. They all allow {opth by(varlist)} as an option. Consult each command's corresponding help files for details. (Note that {cmd:gstats transform} in particular allows embedding options in the statistic call rather than program arguments; while this is technically also possible to do through {cmd:gegen}, I do not recommend it. Instead, use {opt window()} with {it:moving_stat}, {opt interval()} with {it:range_stat}, {opt cumby()} with {it:cumsum}, and so on.) In the table, {it:stat} can be replaced with any stat available to {cmd:gcollapse} except percent, {it:nunique}: {opt function} -> {opt calls} {hline 40} {opth xtile(exp)} -> {help fasterxtile} {opth standardize(varname)} -> {help gstats transform} {opth normalize(varname)} -> {help gstats transform} {opth demean(varname)} -> {help gstats transform} {opth demedian(varname)} -> {help gstats transform} {opth moving_stat(varname)} -> {help gstats transform} {opth range_stat(varname)} -> {help gstats transform} {opth cumsum(varname)} -> {help gstats transform} {opth shift(varname)} -> {help gstats transform} {opth rank(varname)} -> {help gstats transform} {opth winsor(varname)} -> {help gstats winsor} {opth winsorize(varname)} -> {help gstats winsor} {pstd} The functions listed below have been compiled and hence will run very quickly. Functions not listed here hash the data and then call {opt egen} with {opth by(varlist)} set to the hash, which is often faster than calling {opt egen} directly, but not always. Natively supported functions should always be faster, however. They are: {phang2} {opth group(varlist)} [{cmd:,} {opt m:issing} {opth counts(newvarname)} {opth fill(real)}]{p_end} {pmore2} may not be combined with {cmd:by}. It creates one variable taking on values 1, 2, ... for the groups formed by {it:varlist}. {it:varlist} may contain numeric variables, string variables, or a combination of the two. The default order of the groups is the sort order of the {it:varlist}. However, the user can specify: {pmore3} [{cmd:+}|{cmd:-}] {varname} [[{cmd:+}|{cmd:-}] {varname} {it:...}] {pmore2} And the order will be inverted for variables that have {cmd:-} prepended. {opt missing} indicates that missing values in {it:varlist} {bind:(either {cmd:.} or {cmd:""}}) are to be treated like any other value when assigning groups, instead of as missing values being assigned to the group missing. {pmore2} You can also specify {opt counts()} to generate a new variable with the number of observations per group; by default all observations within a group are filled with the count, but via {opt fill()} the user can specify the value the variable will take after the first observation that appears within a group. The user can also specify {opt fill(data)} to fill the first J{it:th} observations with the count per group (in the sorted group order) or {opt fill(group)} to keep the default behavior. {phang2} {opth tag(varlist)} [{cmd:,} {opt m:issing}]{p_end} {pmore2} may not be combined with {cmd:by}. It tags just 1 observation in each distinct group defined by {it:varlist}. When all observations in a group have the same value for a summary variable calculated for the group, it will be sufficient to use just one value for many purposes. The result will be 1 if the observation is tagged and never missing, and 0 otherwise. {pmore2} Note values for any observations excluded by either {helpb if} or {helpb in} are set to 0 (not missing). Hence, if {opt tag} is the variable produced by {cmd:egen tag =} {opt tag(varlist)}, the idiom {opt if tag} is always safe. {opt missing} specifies that missing values of {it:varlist} may be included. {opth first|last|firstnm|lastnm(exp)}{right:(allows {help by:{bf:by} {it:varlist}{bf::}}) } {pmore2} creates a constant (within {it:varlist}) containing the first, last, first non-missing, and last non-missing observation. The functions are analogous to those in {opt collapse} and {opt not} to those in {opt egenmore}. {opth count(exp)} {right:(allows {help by:{bf:by} {it:varlist}{bf::}}) } {pmore2} creates a constant (within {it:varlist}) containing the number of nonmissing observations of {it:exp}. {opth nunique(exp)} {right:(allows {help by:{bf:by} {it:varlist}{bf::}}) } {pmore2} creates a constant (within {it:varlist}) containing the number of unique observations of {it:exp}. {opth iqr(exp)}{right:(allows {help by:{bf:by} {it:varlist}{bf::}}) } {pmore2} creates a constant (within {it:varlist}) containing the interquartile range of {it:exp}. Also see {help gegen##pctile():{bf:pctile()}}. {opth max(exp)}{right:(allows {help by:{bf:by} {it:varlist}{bf::}}) } {pmore2} creates a constant (within {it:varlist}) containing the maximum value of {it:exp}. {marker mean()}{...} {opth mean(exp)}{right:(allows {help by:{bf:by} {it:varlist}{bf::}}) } {pmore2} creates a constant (within {it:varlist}) containing the mean of {it:exp}. {marker geomean()}{...} {opth geomean(exp)}{right:(allows {help by:{bf:by} {it:varlist}{bf::}}) } {pmore2} creates a constant (within {it:varlist}) containing the geometric mean of {it:exp}. If {it:exp} has negative values, the function returns missing (.). If {it:exp} has any zeros, the function returns zero. {marker median()}{...} {opth median(exp)}{right:(allows {help by:{bf:by} {it:varlist}{bf::}}) } {pmore2} creates a constant (within {it:varlist}) containing the median of {it:exp}. Also see {help gegen##pctile():{bf:pctile()}}. {opth min(exp)}{right:(allows {help by:{bf:by} {it:varlist}{bf::}}) } {pmore2} creates a constant (within {it:varlist}) containing the minimum value of {it:exp}. {opth range(exp)}{right:(allows {help by:{bf:by} {it:varlist}{bf::}}) } {pmore2} creates a constant (within {it:varlist}) containing the value range of {it:exp}. {marker select()}{...} {opth select(exp)} {cmd:, n(}{it:#}|{it:-#}{cmd:)}{right:(allows {help by:{bf:by} {it:varlist}{bf::}}) } {pmore2} creates a constant (within {it:varlist}) containing the {it:#}th smallest value of {it:exp}. To compute the {it:#}th largest value, prefix a negative sign, {it:-#}. Note that without weights, {opt n(1)} and {opt n(-1)} will give the same value as {opt min} and {opt max}, respectively. {marker pctile()}{...} {opth pctile(exp)} [{cmd:, p(}{it:#}{cmd:)}]{right:(allows {help by:{bf:by} {it:varlist}{bf::}}) } {pmore2} creates a constant (within {it:varlist}) containing the {it:#}th percentile of {it:exp}. If {opt p(#)} is not specified, 50 is assumed, meaning medians. Also see {help gegen##median():{bf:median()}}. {opth sd(exp)}{right:(allows {help by:{bf:by} {it:varlist}{bf::}}) } {pmore2} creates a constant (within {it:varlist}) containing the standard deviation of {it:exp}. Also see {help gegen##mean():{bf:mean()}}. {opth variance(exp)}{right:(allows {help by:{bf:by} {it:varlist}{bf::}}) } {pmore2} creates a constant (within {it:varlist}) containing the variance of {it:exp}. Also see {help gegen##sd():{bf:sd()}}. {opth cv(exp)}{right:(allows {help by:{bf:by} {it:varlist}{bf::}}) } {pmore2} creates a constant (within {it:varlist}) containing the coefficient of variation of {it:exp}; {opt sd/mean}. Also see {help gegen##sd():{bf:sd()}} and {help gegen##mean():{bf:mean()}}. {opth percent(exp)}{right:(allows {help by:{bf:by} {it:varlist}{bf::}}) } {pmore2} creates a constant (within {it:varlist}) containing the percent of non-missing observations of {it:exp} in the group relative to the sample. {opth semean(exp)}{right:(allows {help by:{bf:by} {it:varlist}{bf::}}) } {pmore2} creates a constant (within {it:varlist}) containing the standard error of the mean of {it:exp}, (sd/sqrt(n)). {opth sebinomial(exp)}{right:(allows {help by:{bf:by} {it:varlist}{bf::}}) } {pmore2} creates a constant (within {it:varlist}) containing the standard error of the mean of {it:exp}, binomial (sqrt(p(1-p)/n)) (missing if {it:exp} not 0, 1). {opth sepoisson(exp)}{right:(allows {help by:{bf:by} {it:varlist}{bf::}}) } {pmore2} creates a constant (within {it:varlist}) containing the standard error of the mean of {it:exp}, Poisson (sqrt(mean / n)) (missing if {it:exp} is negative; result rounded to nearest integer) {opth skewness(exp)}{right:(allows {help by:{bf:by} {it:varlist}{bf::}}) } {pmore2} creates a constant (within {it:varlist}) containing the skewness of {it:exp} {opth kurtosis(exp)}{right:(allows {help by:{bf:by} {it:varlist}{bf::}}) } {pmore2} creates a constant (within {it:varlist}) containing the kurtosis of {it:exp} {opth sum(exp)} [{cmd:,} {opt m:issing}] {right:(allows {help by:{bf:by} {it:varlist}{bf::}}) } {opth total(exp)} [{cmd:,} {opt m:issing}] {right:(allows {help by:{bf:by} {it:varlist}{bf::}}) } {pmore2} creates a constant (within {it:varlist}) containing the sum of {it:exp} treating missing as 0. If {opt missing} is specified and all values in {it:exp} are missing, {it:newvar} is set to missing. Also see {help gegen##mean():{bf:mean()}}. {opth gini(exp)}{right:(allows {help by:{bf:by} {it:varlist}{bf::}}) } {opth gini|dropneg(exp)}{right:(allows {help by:{bf:by} {it:varlist}{bf::}}) } {opth gini|keepneg(exp)}{right:(allows {help by:{bf:by} {it:varlist}{bf::}}) } {pmore2} creates a constant (within varlist) containing the Gini coefficient of exp, truncating negative values to 0. {opt gini|dropneg} drops negative values, and {opt gini|keepneg} keeps negative values as is (the user is responsible for the interpretation of the Gini coefficient in this case). {marker description}{...} {title:Description} {pstd} {cmd:gegen} creates {newvar} of the optionally specified storage type equal to {it:fcn}{cmd:(}{it:arguments}{cmd:)}. Here {it:fcn}{cmd:()} is either one of the internally supported commands above or a by-able function written for {cmd:egen}, as documented above. Only {cmd:egen} functions or internally supported functions may be used with {cmd:egen}. If you want to generate multiple summary statistics from a single variable it may be faster to use {opt gcollapse} with the {opt merge} option. {pstd} Depending on {it:fcn}{cmd:()}, {it:arguments}, if present, refers to an expression, {varlist}, or a {it:{help numlist}}, and the {it:options} are similarly {it:fcn} dependent. {marker memory}{...} {title:Out of memory} {pstd} (See also Stata's own discussion: {help memory:help memory}.) {pstd} There are many reasons for why an OS may run out of memory. The best-case scenario is that your system is running some other memory-intensive program. This is specially likely if you are running your program on a server, where memory is shared across all users. In this case, you should attempt to re-run {it:gegen} once other memory-intensive programs finish. {pstd} If no memory-intensive programs were running concurrently, the second best-case scenario is that your user has a memory cap that your programs can use. Again, this is specially likely on a server, and even more likely on a computing grid. If you are on a grid, see if you can increase the amount of memory your programs can use (there is typically a setting for this). If your cap was set by a system administrator, consider contacting them and asking for a higher memory cap. {pstd} If you have no memory cap imposed on your user, the likely scenario is that your system cannot allocate enough memory for {it:gegen}. At this point you have two options: One option is to try {it:fegen} or {it:egen}, which are slower but using either should require a trivial one-letter change to the code; another option is to re-write egen the data in segments (the easiest way to do this would be to egen a portion of all rows at a time and perform a series of append statements at the end.) If you have no memory cap imposed on your user, the likely scenario is that your system cannot allocate enough memory for {it:gegen}. At this point you can try {it:fegen} or {it:egen}, which are slower but using either should require a trivial one-letter change to the code. Note, however, that replacing {it:gegen} with {it:fegen} or plain {it:egen} is not guaranteed to use less memory. I have not benchmarked memory use very extensively, so {it:gegen} might use less memory (I doubt that is the case in most scenarios, but it is possible). {pstd} You can also try to process the data by segments. However, if you are doing group operations you would need to first sort the data and make sure you are not splitting groups apart. {marker example}{...} {title:Examples} {pstd} See the {browse "http://gtools.readthedocs.io/en/latest/usage/gegen/index.html#examples":online documentation} for examples. {marker author}{...} {title:Author} {pstd}Mauricio Caceres Bravo{p_end} {pstd}{browse "mailto:mauricio.caceres.bravo@gmail.com":mauricio.caceres.bravo@gmail.com }{p_end} {pstd}{browse "https://mcaceresb.github.io":mcaceresb.github.io}{p_end} {title:Website} {pstd}{cmd:gegen} is maintained as part of {manhelp gtools R:gtools} at {browse "https://github.com/mcaceresb/stata-gtools":github.com/mcaceresb/stata-gtools}{p_end} {marker acknowledgment}{...} {title:Acknowledgment} {pstd} This help file was based on StataCorp's own help file for {it:egen}. {p_end} {pstd} This project was largely inspired by Sergio Correia's {it:ftools}: {browse "https://github.com/sergiocorreia/ftools"}. {p_end} {pstd} The OSX version of gtools was implemented with invaluable help from @fbelotti; see {browse "https://github.com/mcaceresb/stata-gtools/issues/11"}. {p_end} {title:Also see} {p 4 13 2} help for {help gcollapse}, {help gtools}; {help fegen} (if installed), {help fcollapse} (if installed), {help ftools} (if installed) p_end}