{smcl} {* *! version 0.2.1 30Jan2020}{...} {viewerdialog gstats_summarize "dialog gstats_summarize"}{...} {vieweralsosee "[R] gstats_summarize" "mansection R gstats_summarize"}{...} {viewerjumpto "Syntax" "gstats_summarize##syntax"}{...} {viewerjumpto "Description" "gstats_summarize##description"}{...} {viewerjumpto "Statistics" "gstats_summarize##statistics"}{...} {title:Title} {p2colset 5 25 28 2}{...} {p2col :{cmd:gstats summarize} {hline 2}} Summary statistics by group using C for speed {p_end} {p2colreset}{...} {pstd} {it:Important}: Please run {stata gtools, upgrade} to update {cmd:gtools} to the latest stable version. {marker syntax}{...} {title:Syntax} {p 8 17 2} {cmd:gstats {ul:sum}marize} {varlist} {ifin} [{it:{help gstats summarize##weight:weight}}] [{cmd:,} {opth by(varlist)} {it:{help gstats summarize##table_options:options}}] {p 8 17 2} {cmd:gstats {ul:tab}stat} {varlist} {ifin} [{it:{help gstats summarize##weight:weight}}] [{cmd:,} {opth by(varlist)} {it:{help gstats summarize##table_options:options}}] {pstd} {cmd:gstats {ul:tab}stat} and {cmd:gstats {ul:sum}marize} are fast, by-able alternatives to {opt tabstat} and {opt summarize, detail}. If {cmd:gstats summarize} is called with {opt by()} or {opt tab}, a table in the style of {opt tabstat} is produced that inclues all the summary statistics included by default in {opt summarize, detail}. {pstd} Note the {it:prefixes} {cmd:by}, {cmd:rolling}, {cmd:statsby} are {cmd:{it:not}} supported. To compute a table of statistics by a group use the option {opt by()}. With {opt by()}, {opt gstats tab} is also faster than {cmd:gcollapse}. {synoptset 23 tabbed}{...} {marker table_options}{...} {synopthdr} {synoptline} {syntab :Tabstat Options} {synopt:{opth by(varlist)}}Group statistics by variable. {p_end} {synopt:{cmdab:s:tatistics:(}{it:{help gstats_summarize##statname:stat}} [{it:...}]{cmd:)}}Report specified statistics; default for {opt tabstat} is count, sum, mean, sd, min, max. {p_end} {synopt:{opt col:umns(stat|var)}}Columns are statistics (default) or variables. {p_end} {synopt:{opt pretty:stats}}Pretty statistic header names {p_end} {synopt:{opth labelw:idth(int)}}Max by variable label/value width. {p_end} {synopt:{opt f:ormat}[{cmd:(%}{it:{help format:fmt}}{cmd:)}]} Use format to display summary stats; default %9.0g {p_end} {syntab :Summarize Options} {synopt:{opt nod:etail}}Do not display the full set of statistics. {p_end} {synopt:{opt mean:only}}Calculate only the count, sum, mean, min, max. {p_end} {synopt:{opth by(varlist)}}Group by variable; all stats are computed but output is in the style of tabstat. {p_end} {synopt:{opt sep:arator(#)}}Draw separator line after every {it:#} variables; default is {cmd:separator(5)}. {p_end} {synopt:{opt tab:stat}}Compute and display statistics in the style of {opt tabstat}. {p_end} {syntab :Common Options} {synopt:{opt mata:save}[{cmd:(}{it:str}{cmd:)}]}Save results in mata object (default name is {bf:GstatsOutput}) {p_end} {synopt:{opt pool:ed}}Pool varlist {p_end} {synopt:{opt noprint}}Do not print {p_end} {synopt:{opt f:ormat}}Use variable's display format. {p_end} {synopt:{opt nomiss:ing}}With {opt by()}, ignore groups with missing entries. {p_end} {syntab:Gtools Options} {synopt :{opt compress}}Try to compress strL to str#. {p_end} {synopt :{opt forcestrl}}Skip binary variable check and force gtools to read strL variables. {p_end} {synopt :{opt v:erbose}}Print info during function execution. {p_end} {synopt :{opt bench}{it:[(int)]}}Benchmark various steps of the plugin. Optionally specify depth level. {p_end} {synopt :{opth hash:method(str)}}Hash method (default, biject, or spooky). Intended for debugging. {p_end} {synopt :{opth oncollision(str)}}Collision handling (fallback or error). Intended for debugging. {p_end} {synoptline} {p2colreset}{...} {p 4 6 2} {marker weight}{...} {p 4 6 2} {opt aweight}s, {opt fweight}s, {opt iweight}s, and {opt pweight}s are allowed (see {manhelp weight U:11.1.6 weight} for more on the way Stata uses weights). {marker description}{...} {title:Description} {pstd} {opt gstats tab} and {opt gstats sum} are mainly designed to report statistics by group. It does not modify the data in memory, so it is a nice alternative to {opt gcollapse} when there are few groups and you want to compute summary stats more quickly. {pstd} {opt gstats sum} by default computes the staistics that are reported by {opt sum, detail} and without {opt by()} it is anywhere from 5 to 40 times faster. The lower end of the speed gains are for Stata/MP, but {opt sum, detail} is very slow in versions of Stata that are not multi-threaded. The behavior of plain {opt summarize} and {opt summarize, meanonly} can be recovered via options {opt nodetail} and {opt meanonly}, but Stata is not specially slow in this case. Hence they are mainly included for use with {opt by()}, where {opt gstats sum} is again faster. {pstd} {opt gstats tab} should be faster than {opt tabstat} even without groups, but the speed gains are largest with even a modest number of levels in {opt by()}. Furthermore, an arbitrary number of grouping variables are allowed. Note that with a very large numer of groups, {opt tabstat}'s runtime seems to scale non-linearly, while {opt gstats tab} will execute in a reasonable time. {pstd} {opt gstata tab} does not store results in {opt r()}. Rather, the option {opt matasave} is provided to store the full set of summary statistics and the by variable levels in a mata class object called {opt statsOutput} (the name of the object can be changed via {opt matasave(name)}). Run {opt mata GstatsOutput.desc()} after {opt gstats tab, matasave} for details. The following helper functions are provided: string scalar getf(j, l, maxlbl) get formatted (j, l) entry from by variables up to maxlbl characters real matrix getnum(j, l) get (j, l) numeric entry from by variables string matrix getchar(j, l,| raw) get (j, l) numeric entry from by variables; raw controls whether to null-pad entries real rowvector getOutputRow(j) get jth output row real colvector getOutputCol(j) get jth output column by position real matrix getOutputVar(var) get jth output var by name real matrix getOutputGroup(j) get jth output group {pstd} The following data is stored {opt GstatsOutput}: summary statistics ------------------ real matrix output matrix with output statistics; J x kstats x kvars real scalar colvar 1: columns are variables, rows are statistics; 0: the converse real scalar ksources number of variable sources (0 if pool is true) real scalar kstats number of statistics real matrix tabstat 1: used tabstat; 0: used summarize string rowvector statvars variables summarized string rowvector statnames statistics computed real rowvector scodes internal code for summary statistics real scalar pool pooled source variables variable levels (empty if without -by()-) ----------------------------------------- real scalar anyvars 1: any by variables; 0: no by variables real scalar anynum 1: any numeric by variables; 0: all string by variables real scalar anychar 1: any string by variables; 0: all numeric by variables string rowvector byvars by variable names real scalar kby number of by variables real scalar rowbytes number of bytes in one row of the internal by variable matrix real scalar J number of levels real matrix numx numeric by variables string matrix charx string by variables real scalar knum number of numeric by variables real scalar kchar number of string by variables real rowvector lens > 0: length of string by variables; <= 0: internal code for numeric variables real rowvector map map from index to numx and charx printing options ---------------- void printOutput() print summary table real scalar maxlbl max by variable label/value width real scalar pretty print pretty statistic names real scalar usevfmt use variable format for printing string scalar dfmt fallback printing format real scalar maxl maximum column length void readDefaults() reset printing defaults {marker statistics}{...} {title:Statistics} {phang} {cmd:statistics(}{it:statname} [{it:...}]{cmd:)} specifies the statistics to be displayed; the default with {opt tabstat} is equivalent to specifying {cmd:statistics(mean)}. ({opt stats()} is a synonym for {opt statistics()}.) Multiple statistics may be specified and are separated by white space, such as {cmd:statistics(mean sd)}. Available statistics are {marker statname}{...} {synoptset 17}{...} {synopt:{space 4}{it:statname}}Definition{p_end} {space 4}{synoptline} {synopt:{space 4}{opt me:an}} mean{p_end} {synopt:{space 4}{opt geomean}}geometric mean (missing if var has any negative values){p_end} {synopt:{space 4}{opt co:unt}} count of nonmissing observations{p_end} {synopt:{space 4}{opt n}} same as {cmd:count}{p_end} {synopt:{space 4}{opt nmiss:ing}} number of missing observations{p_end} {synopt:{space 4}{opt perc:ent}} percentage of nonmissing observations{p_end} {synopt:{space 4}{opt nuniq:ue}} number of unique elements{p_end} {synopt:{space 4}{opt su:m}} sum{p_end} {synopt:{space 4}{opt rawsu:m}} sum, ignoring optionally specified weights ({bf:note}: zero-weighted obs are still excluded){p_end} {synopt:{space 4}{opt nansu:m}} sum; returns . instead of 0 if all entries are missing{p_end} {synopt:{space 4}{opt rawnansu:m}} rawsum; returns . instead of 0 if all entries are missing{p_end} {synopt:{space 4}{opt med:ian}} median (same as {opt p50}){p_end} {synopt:{space 4}{opt p#.#}} arbitrary quantiles{p_end} {synopt:{space 4}{opt p1}} 1st percentile{p_end} {synopt:{space 4}{opt p2}} 2nd percentile{p_end} {synopt:{space 4}{it:...}} 3rd-49th percentiles{p_end} {synopt:{space 4}{opt p50}} 50th percentile (same as {opt median}){p_end} {synopt:{space 4}{it:...}} 51st-97th percentiles{p_end} {synopt:{space 4}{opt p98}} 98th percentile{p_end} {synopt:{space 4}{opt p99}} 99th percentile{p_end} {synopt:{space 4}{opt iqr}} interquartile range = {opt p75} - {opt p25}{p_end} {synopt:{space 4}{opt q}} equivalent to specifying {cmd:p25 p50 p75}{p_end} {synopt:{space 4}{opt sd}} standard deviation{p_end} {synopt:{space 4}{opt v:ariance}} variance{p_end} {synopt:{space 4}{opt cv}} coefficient of variation ({cmd:sd/mean}){p_end} {synopt:{space 4}{opt select#}} #th smallest{p_end} {synopt:{space 4}{opt select-#}} #th largest{p_end} {synopt:{space 4}{opt mi:n}} minimum (same as {opt select1}){p_end} {synopt:{space 4}{opt ma:x}} maximum (same as {opt select-1}){p_end} {synopt:{space 4}{opt r:ange}} range = {opt max} - {opt min}{p_end} {synopt:{space 4}{opt first}} first value{p_end} {synopt:{space 4}{opt last}} last value{p_end} {synopt:{space 4}{opt firstnm}} first nonmissing value{p_end} {synopt:{space 4}{opt lastnm}} last nonmissing value{p_end} {synopt:{space 4}{opt sem:ean}} standard error of mean ({cmd:sd/sqrt(n)}){p_end} {synopt:{space 4}{opt seb:inomial}} standard error of the mean, binomial ({cmd:sqrt(p(1-p)/n)}){p_end} {synopt:{space 4}{opt sep:oisson}} standard error of the mean, Poisson ({cmd:sqrt(mean)}){p_end} {synopt:{space 4}{opt sk:ewness}} skewness{p_end} {synopt:{space 4}{opt k:urtosis}} kurtosis{p_end} {synopt:{space 4}{opt gini}}Gini coefficient (negative truncated to 0){p_end} {synopt:{space 4}{opt gini|dropneg}}Gini coefficient (negative values dropped){p_end} {synopt:{space 4}{opt gini|keepneg}}Gini coefficient (negative values kept; the user is responsible for the interpretation of the Gini in this case){p_end} {space 4}{synoptline} {p2colreset}{...} {marker example}{...} {title:Examples} {pstd} See the {browse "http://gtools.readthedocs.io/en/latest/usage/gstats_summarize/index.html#examples":online documentation} for examples. {marker author}{...} {title:Author} {pstd}Mauricio Caceres{p_end} {pstd}{browse "mailto:mauricio.caceres.bravo@gmail.com":mauricio.caceres.bravo@gmail.com }{p_end} {pstd}{browse "https://mcaceresb.github.io":mcaceresb.github.io}{p_end} {title:Website} {pstd}{cmd:gstats} is maintained as part of the {manhelp gtools R:gtools} project at {browse "https://github.com/mcaceresb/stata-gtools":github.com/mcaceresb/stata-gtools}{p_end} {marker acknowledgment}{...} {title:Acknowledgment} {pstd} {opt gtools} was largely inspired by Sergio Correia's {it:ftools}: {browse "https://github.com/sergiocorreia/ftools"}. {p_end} {pstd} The OSX version of gtools was implemented with invaluable help from @fbelotti; see {browse "https://github.com/mcaceresb/stata-gtools/issues/11"}. {p_end} {title:Also see} {pstd} help for {help summarize}; {help tabstat}; {help gtools}