{smcl} {* *! version 1.2.0 20Mar2019}{...} {vieweralsosee "[P] gtoplevelsof" "mansection P gtoplevelsof"}{...} {viewerjumpto "Syntax" "gtoplevelsof##syntax"}{...} {viewerjumpto "Description" "gtoplevelsof##description"}{...} {viewerjumpto "Options" "gtoplevelsof##options"}{...} {viewerjumpto "Remarks" "gtoplevelsof##remarks"}{...} {viewerjumpto "Stored results" "gtoplevelsof##results"}{...} {title:Title} {p2colset 5 23 23 2}{...} {p2col :{cmd:gtoplevelsof} {hline 2}}Quickly tabulate most common levels of variable list.{p_end} {p2colreset}{...} {marker syntax}{...} {title:Syntax} {p 8 17 2} {opt gtop:levelsof} {varlist} {ifin} [{it:{help gtoplevelsof##weight:weight}}] [{cmd:,} {it:options}] {synoptset 24 tabbed}{...} {synopthdr} {synoptline} {syntab :Summary Options} {synopt:{opth ntop(int)}} Display {opt ntop} most common levels (negative shows least common; {opt .} shows every level).{p_end} {synopt:{opth freqabove(int)}} Only count freqs above this level.{p_end} {synopt:{opth pctabove(real)}} Only count freqs that represent at least % of the total.{p_end} {synopt:{opt mata:save}[{cmd:(}{it:str}{cmd:)}]}Save results in mata object (default name is {bf:GtoolsByLevels}){p_end} {syntab :Toggles} {synopt:{opt missrow}} Add row with count of missing values.{p_end} {synopt:{opt groupmiss:ing}} Count rows with any variable missing as missing.{p_end} {synopt:{opt nomiss:ing}} Case-wise exclude rows with missing values from frequency count.{p_end} {synopt:{opt nooth:er}} Do not group rest of levels into "other" row.{p_end} {synopt:{opt nong:roups}} Do not specify number of groups in "other" row.{p_end} {synopt:{opt alpha}} Sort the top levels of varlist by variables instead of frequencies.{p_end} {synopt:{opt silent}} Do not display the top levels of varlist.{p_end} {syntab :Display Options} {synopt:{opth pctfmt(format)}} Format for percentages.{p_end} {synopt:{opth oth:erlabel(str)}} Specify label for row with "other" count.{p_end} {synopt:{opth missrow:label(str)}} Specify the label for the row with "missing" count.{p_end} {synopt:{opth varabb:rev(int)}} Abbreviate variables (which are displayed as a header to their levels) .{p_end} {synopt:{opth colmax(numlist)}} Specify width limit for levels (can be single number of variable-specific).{p_end} {synopt:{opth colstrmax(numlist)}} Specify width limit for string variables (can be single number of variable-specific).{p_end} {synopt:{opt cols:eparate(separator)}} Column separator; default is double blank " ".{p_end} {synopt:{opth numfmt(format)}} Format for numeric variables. Default is {opt %.8g} (or {opt %16.0g} with {opt matasave}).{p_end} {synopt:{opt novaluelab:els}} Do not replace numeric variables with their value labels.{p_end} {synopt:{opt hidecont:levels}} If a level is repeated in the subsequent row, display a blank.{p_end} {syntab :levelsof Options} {synopt:{opt l:ocal(macname)}}insert top levels in the local macro {it:macname}{p_end} {synopt:{opt s:eparate(separator)}}separator for the values of returned list; default is a space{p_end} {syntab:Gtools} {synopt :{opt compress}}Try to compress strL to str#. {p_end} {synopt :{opt forcestrl}}Skip binary variable check and force gtools to read strL variables. {p_end} {synopt :{opt v:erbose}}Print info during function execution. {p_end} {synopt :{cmd:bench}[{cmd:(}{int}{cmd:)}]}Benchmark various steps of the plugin. Optionally specify depth level. {p_end} {synopt :{opth hash:method(str)}}Hash method (default, biject, or spooky). Intended for debugging. {p_end} {synopt :{opth oncollision(str)}}Collision handling (fallback or error). Intended for debugging. {p_end} {synoptline} {p2colreset}{...} {marker weight}{...} {p 4 6 2} {opt aweight}s, {opt fweight}s, and {opt pweight}s are allowed, in which case the top levels by weight are printed (see {manhelp weight U:11.1.6 weight}) {p_end} {marker description}{...} {title:Description} {pstd} {cmd:gtoplevelsof} (alias {cmd:gtop}) displays a table with the frequency counts, percentages, and cummulative counts and %s of the most common levels of {varlist} that occur in the data. It is similar to the user-written {cmd:group} with the {opt select} otpion or to {opt contract} after keeping only the largest frequency counts. {pstd} Unlike contract, it does not modify the original data and instead prints the resulting table to the console. It also stores a matrix with the frequency counts and stores the levels in the macro {opt r(levels)}. {pstd} {opt gcontract} is part of the {manhelp gtools R:gtools} project. {marker options}{...} {title:Options} {dlgtab:Summary Options} {phang} {opth ntop(int)} Number of levels to display. This can be negative; in that case, the smallest frequencies are displayed. Note cummulative percentages and counts are computed within each generated table, so for the smallest groups the table would display the cummulative count for those frequencies, in descending order. {opt .} displays every level from most to least frequent; {opt -.} displays every level from least to most frequent. {phang} {opth freqabove(int)} Skip frequencies below this level then determining the largest levels. So if this is 10, only frequencies above 10 will be displayed as part of the top frequencies. If every frequency that would be displayed is above this level then this option has no effect. {phang} {opth pctabove(real)} Skip frequencies that are a smaller percentage of the data than {opt pctabove}. If this is 10, then only frequencies that represent at least 10% of all observations are displayed as part of the top frequencies. If every frequency that would be displayed is at least this percentage of the data then this option has no effect. {phang} {opt mata:save}[{cmd:(}{it:str}{cmd:)}]Save results in mata object (default name is {bf:GtoolsByLevels}). See {opt GtoolsByLevels.desc()} for more. This object contains the raw variable levels in {opt numx} and {opt charx} (since mata does not allow matrices of mixed-type). The levels are saved as a string in {opt printed} (with value labels correctly applied) unless option {opt silent} is also specified. Last, the frequencies matrix is saved in {opt toplevels}. {dlgtab:Toggles} {phang}{opt missrow} Add row with count of missing values. By default, missing rows are treated as another group and will be displayed as part of the top levels. With multiple variables, only rows with all values missing are counted here unless {opt groupmissing} is also passed. If this option is specified then a row is printed after the top levels with the frequency count of missing rows. {phang}{opt groupmissing} This option specifies that a missing row is a row where any of the variables have a missing value. See {opt missrow}. {phang}{opt nomissing} Case-wise exclude rows with missing values from frequency count. By default missing values are treated as another level. {phang}{opt noother} By default a row is printed after the top levels with the frequency count from groups not in the top levels and not counted as missing. This option toggles display of that row. {phang}{opt nongroups} By default the number of groups comprising the "Other" and "Missing" rows are printed as part of the "Other" and "Missing" row labels (should they appear; for the missing row this is only printed if more than 1 missing value type is present). This option toggles display of the number of groups represented. {phang}{opt alpha} Sort the top levels of varlist by variables instead of frequencies. Note that the top levels are still extracted; this just affects the final sort order. To sort in inverse order, just pass {opt gtop -var1 -var2 ...}. {phang}{opt silent} Do not display the top levels of varlist. With option {opt matasave} it also does not store the printed levels in a separate string matrix. {dlgtab:Display Options} {phang}{opth pctfmt(format)} Print format for percentage columns. {phang}{opth otherlabel(str)} Specify label for row with the count of the rest of the levels. {phang}{opth missrowlabel(str)} Specify the label for the row the count of the "missing" levels. {phang}{opth varabbrev(int)} Variables names are displayed above their groups. This option specifies that variables should be abbreviated to at most {opt varabbrev} characters. This is ignored if it is smaller than 5. {phang}{opth colmax(numlist)} Specify width limit for levels (can be single number of variable-specific). {phang}{opth colstrmax(numlist)} Specify width limit for string variables (can be single number of variable-specific). Ths overrides {opt colmax} for strings and allows the user to specify string and number widths sepparately. (Also see {opth numfmt(format)}) {phang}{opth numfmt(format)} Format for numeric variables. Default is {opt %.8g} (or {opt %16.0g} with {opt matasave}). By default the number levels are formatted in C, so this must be a valid format for the C internal {opt printf}. The syntax is very similar to mata's {opt printf}. Some examples are: %.2f, %10.6g, %5.0f, and so on. With option {opt matasave} these are formatted in mata, and the format can be any mata number format. {phang}{opt colseparate(separator)} Column separator; default is double blank " ". {phang}{opt novaluelabels} Do not replace numeric variables with their value labels. Value label widths are governed by colmax and NOT colstrmax. {phang}{opt hidecontlevels} If a level is repeated in the subsequent row, display a blank. This is only done if both observations are within the same outer level. {dlgtab:levelsof Options} {phang} {cmd:local(}{it:macname}{cmd:)} inserts the list of levels in local macro {it:macname} within the calling program's space. Hence, that macro will be accessible after {cmd:gtoplevelsof} has finished. This is helpful for subsequent use. Note this uses {opt colseparate} to sepparate columns. The default is " " so be careful when parsing! Rows are enclosed in double quotes (`""') so parsing is possible, just not trivial. {phang} {cmd:separate(}{it:separator}{cmd:)} specifies a separator to serve as punctuation for the values of the returned list. The default is a space. A useful alternative is a comma. {dlgtab:Gtools} {phang} {opt compress} Try to compress strL to str#. The Stata Plugin Interface has only limited support for strL variables. In Stata 13 and earlier (version 2.0) there is no support, and in Stata 14 and later (version 3.0) there is read-only support. The user can try to compress strL variables using this option. {phang} {opt forcestrl} Skip binary variable check and force gtools to read strL variables (14 and above only). {opt Gtools gives incorrect results when there is binary data in strL variables}. This option was included because on some windows systems Stata detects binary data even when there is none. Only use this option if you are sure you do not have binary data in your strL variables. {phang} {opt verbose} prints some useful debugging info to the console. {phang} {opt bench:mark} and {opt bench:marklevel(int)} print how long in seconds various parts of the program take to execute. The user can also pass {opth bench(int)} for finer control. {opt bench(1)} is the same as benchmark but {opt bench(2)} and {opt bench(3)} additionally print benchmarks for internal plugin steps. {phang} {opth hashmethod(str)} Hash method to use. {opt default} automagically chooses the algorithm. {opt biject} tries to biject the inputs into the natural numbers. {opt spooky} hashes the data and then uses the hash. {phang} {opth oncollision(str)} How to handle collisions. A collision should never happen but just in case it does {opt gtools} will try to use native commands. The user can specify it throw an error instead by passing {opt oncollision(error)}. {marker remarks}{...} {title:Remarks} {pstd} {cmd:gtoplevelsof} has the main function of displaying the most common levels of {it:varlist}. While {opt tab} is great, it cannot handle a large number of levels, and it prints ALL the levels in alphabetical order. {pstd} Very often when exploring data I just want to have a quick look at the largest levels of a variable that may have thousands of levels in a data set with millions of rows. {opt gcontract} and {opt gcollapse} are great but they modify the original data and doing a lot of subsequent preserve, sort, restore gets very slow very fast. {pstd} I have found this command extremely helpful when exploring big data. Specially if a string is not clean, then having a look at the largest values or the largest values that match a pattern is very helpful. {marker examples}{...} {title:Examples} {pstd} See the {browse "http://gtools.readthedocs.io/en/latest/usage/gtoplevelsof/index.html#examples":online documentation} for more examples. {phang}{cmd:. sysuse auto}{p_end} {phang}{cmd:. gtoplevelsof rep78}{p_end} {phang}{cmd:. gtoplevelsof rep78, missrow local(toplevels)}{p_end} {phang}{cmd:. gtop rep78, colsep(", ")}{p_end} {phang}{cmd:. gtop foreign rep78, ntop(3) missrow}{p_end} {marker results}{...} {title:Stored results} {pstd} {cmd:gtoplevelsof} stores the following in {cmd:r()}: {synoptset 15 tabbed}{...} {p2col 5 15 19 2: Macros}{p_end} {synopt:{cmd:r(levels)}}list of top (most common) levels (rows); not with {opt matasave}{p_end} {synopt:{cmd:r(matalevels)}}name of GtoolsByLevels mata object; only with {opt matasave}{p_end} {p2colreset}{...} {synoptset 20 tabbed}{...} {p2col 5 20 24 2: Scalars}{p_end} {synopt:{cmd:r(N) }} number of non-missing observations {p_end} {synopt:{cmd:r(J) }} number of groups {p_end} {synopt:{cmd:r(minJ) }} largest group size {p_end} {synopt:{cmd:r(maxJ) }} smallest group size {p_end} {synopt:{cmd:r(ntop) }} number of top levels {p_end} {synopt:{cmd:r(nrows)}} number of rows in {opt toplevels} {p_end} {synopt:{cmd:r(alpha)}} sorted by levels intead of frequencies {p_end} {p2colreset}{...} {synoptset 20 tabbed}{...} {p2col 5 20 24 2: Matrices}{p_end} {synopt:{cmd:r(toplevels)}}Table with frequency counts and percentages.{p_end} {p2colreset}{...} {pstd} The missing and other rows are stored in the matrix with IDs 2 and 3, respectively. With {opt matasave}, the following data is stored in {opt GtoolsByLevels}: real scalar anyvars 1: any by variables; 0: no by variables real scalar anychar 1: any string by variables; 0: all numeric by variables real scalar anynum 1: any numeric by variables; 0: all string by variables string rowvector byvars by variable names real scalar kby number of by variables real scalar rowbytes number of bytes in one row of the internal by variable matrix real scalar J number of levels real matrix numx numeric by variables string matrix charx string by variables real scalar knum number of numeric by variables real scalar kchar number of string by variables real rowvector lens > 0: length of string by variables; <= 0: internal code for numeric variables real rowvector map map from index to numx and charx real rowvector charpos position of kth character variable string matrix printed formatted (printf-ed) variable levels (not with option -silent-) real matrix toplevels frequencies of top levels; missing and other rows stored with ID 2 and 3. {marker author}{...} {title:Author} {pstd}Mauricio Caceres Bravo{p_end} {pstd}{browse "mailto:mauricio.caceres.bravo@gmail.com":mauricio.caceres.bravo@gmail.com }{p_end} {pstd}{browse "https://mcaceresb.github.io":mcaceresb.github.io}{p_end} {title:Website} {pstd}{cmd:gtoplevelsof} is maintained as part of {it:gtools} at {browse "https://github.com/mcaceresb/stata-gtools":github.com/mcaceresb/stata-gtools}{p_end} {marker acknowledgment}{...} {title:Acknowledgment} {pstd} This project was largely inspired by Sergio Correia's {it:ftools}: {browse "https://github.com/sergiocorreia/ftools"}. {p_end} {pstd} The OSX version of gtools was implemented with invaluable help from @fbelotti; see {browse "https://github.com/mcaceresb/stata-gtools/issues/11"}. {p_end} {title:Also see} {p 4 13 2} help for {help gcontract}, {help glevelsof}, {help gtools}; {help flevelsof} (if installed), {help ftools} (if installed)