2.49.0 06may2022{smcl}
{* *! version 2.49.0 06may2022}{...}
{vieweralsosee "ftools" "help ftools"}{...}
{vieweralsosee "[R] collapse" "help collapse"}{...}
{vieweralsosee "[R] contract" "help contract"}{...}
{viewerjumpto "Syntax" "fcollapse##syntax"}{...}
{viewerjumpto "Description" "fcollapse##description"}{...}
{viewerjumpto "Options" "fcollapse##options"}{...}
{title:Title}

{p2colset 5 18 23 2}{...}
{p2col :{cmd:fcollapse} {hline 2}}Efficiently
make dataset of summary statistics{p_end}
{p2colreset}{...}

{marker syntax}{...}
{title:Syntax}

{p 8 17 2}
{cmd:fcollapse}
{it:clist}
{ifin}
[{cmd:,} {it:{help fcollapse##table_options:options}}]

{pstd}where {it:clist} is either

{p 8 17 2}
[{opt (stat)}]
{varlist}
[ [{opt (stat)}] {it:...} ]{p_end}

{p 8 17 2}
[{opt (stat)}] {it:target_var}{cmd:=}{varname}
        [{it:target_var}{cmd:=}{varname} {it:...}]
        [ [{opt (stat)}] {it:...}]

{p 4 4 2}or any combination of the {it:varlist} or {it:target_var} forms, and
{it:stat} is one of{p_end}

{p2colset 9 22 24 2}{...}
{p2col :{opt mean}}means (default){p_end}
{p2col :{opt median}}medians{p_end}
{p2col :{opt p1}}1st percentile{p_end}
{p2col :{opt p2}}2nd percentile{p_end}
{p2col :{it:...}}3rd{hline 1}49th percentiles{p_end}
{p2col :{opt p50}}50th percentile (same as {cmd:median}){p_end}
{p2col :{it:...}}51st{hline 1}97th percentiles{p_end}
{p2col :{opt p98}}98th percentile{p_end}
{p2col :{opt p99}}99th percentile{p_end}
{p2col :{opt sum}}sums{p_end}
{p2col :{opt count}}number of nonmissing observations{p_end}
{p2col :{opt percent}}percentage of nonmissing observations{p_end}
{p2col :{opt max}}maximums{p_end}
{p2col :{opt min}}minimums{p_end}
{p2col :{opt iqr}}interquartile range{p_end}
{p2col :{opt first}}first value{p_end}
{p2col :{opt last}}last value{p_end}
{p2col :{opt firstnm}}first nonmissing value{p_end}
{p2col :{opt lastnm}}last nonmissing value{p_end}
{p2col :{opt nansum}}same as sum, but if all obs. in the group are missing it will also be missing (instead of zero){p_end}
{p2col :{opt raw}{inp:{bf:{it:stat}}}}compute stats while ignoring weights (a generalization of {it:rawsum}){p_end}
{p2colreset}{...}

{pstd}
If {it:stat} is not specified, {opt mean} is assumed.

{pstd}
Technical limitation: Both normal stats and {it:raw} stats will ignore zero weights

{synoptset 15 tabbed}{...}
{marker table_options}{...}
{synopthdr}
{synoptline}
{syntab :Options}
{synopt :{opth by(varlist)}}groups over which {it:stat} is to be calculated
{p_end}
{synopt :{opt merge}}merge collapsed dataset back into the original one;
if the dataset is unsorted or sorted by something different than {opt by()},
it is much more efficient than {cmd:egen} and that combining {cmd:collapse} with {cmd:merge}
{p_end}
{synopt :{opt append}}append collapsed dataset at the end of the original one;
this is useful to create rows of totals
{p_end}
{synopt :{opt cw}}casewise deletion instead of all possible observations
{p_end}
{synopt :{opt fast}}do not preserve and restore the original dataset;
saves speed but leaves the data in an unusable state shall the
user press {hi:Break}
{p_end}
{synopt :{opt smart}}invoke {cmd:collapse} if the data is already sorted (in which case {cmd:collapse} might be faster)
{p_end}
{synopt :{cmd:freq}[{cmd:(}{newvar}{cmd:)}]}store
the raw observation count (similar to {help contract}).
If not indicated, the name of the new variable will be {it:_freq}
{p_end}
{synopt :{opt reg:ister(keys)}}add new stat functions.
For each key, a corresponding Mata function should exist.
See example at the end
{p_end}
{synopt :{opt pool(#)}}load the data into stata in blocks of # variables
Default is {it:pool(.)}, select a low value ({it:pool(5)})
or very low value ({it:pool(1)}) to save memory at the cost of speed
{p_end}
{synopt :{opt nocompress}}{it:compress} chooses the most compact variable type, at a small speed cost
(on by default)
{p_end}
{synopt :{opt v:erbose}}display misc. debug messages
{p_end}

{synoptline}
{p2colreset}{...}
{p 4 6 2}


{marker description}{...}
{title:Description}

{pstd}
{opt fcollapse} converts the dataset in memory into a dataset of means, sums,
medians, etc.  {it:clist} can refer to numeric and string variables
although string variables are only supported by a few functions
(first, last, firstnm, lastnm).

{pstd}
Weights are only partially supported.

{pstd}
You can implement your own Mata functions to easily extend the fcollapse command.


{marker options}{...}
{title:Options}

{dlgtab:Options}

{phang}
{opth by(varlist)} specifies the groups over which the means, etc., are to be
calculated.  If this option is not specified, the resulting dataset will
contain 1 observation.  If it is specified, {it:varlist} may refer to either
string or numeric variables.

{phang}
{opt merge} works similarly to {cmd:egen}.
It will collapse the data in Mata and then add it back to the original dataset.
If the dataset is not sorted by the groups set in {opt by()}, this is much faster than {cmd:egen} and {cmd:collapse} followed by {cmd:merge}.

{phang}
{opt cw} specifies casewise deletion.  If {opt cw} is not specified, all
possible observations are used for each calculated statistic.

{phang}
{opt fast} specifies that {opt fcollapse} not restore the original dataset
should the user press {hi:Break}.

{phang}
{opt freq} stores frequencies on a new variable {it:_freq}.
To choose the name of the variable, use {opth freq(newvar)}

{phang}
{opt reg:ister(fun1 ...)} registers Mata functions {it:fun1}, etc. so 
to extend {cmd fcollapse}; see example below.

{phang}
{opt pool(#)} load the data into Stata in blocks of # variables Default is pool(.),
select a low value (pool(5)) or very low value (pool(1)) to save memory at the cost of speed.

{phang}
{opt compress} will fit variables into more compact types, such as {it:byte},
{it:int}, and {it:long}, without losing information when compared to more accurate types
such as {it:double}.
The cost is a slight reduction in speed, due to the extra checks involved.

{marker example}{...}
{title:Example: Adding your own aggregation functions}

The following code adds the stat. {it:variance}:


{inp}    sysuse auto, clear

    cap mata: mata drop aggregate_variance()
    
    mata:
    mata set matastrict on
    transmorphic colvector aggregate_variance(
        class Factor F,
        transmorphic colvector data,
        real colvector weights)
    {
        real scalar i
        transmorphic colvector results
        results = J(F.num_levels, 1, missingof(data))
        for (i = 1; i <= F.num_levels; i++) {
            results[i] = quadvariance(panelsubmatrix(data, i, F.info))
        }
        return(results)
    }
    end
    
    fcollapse (mean) price (variance) weight foreign, by(turn) register(variance) freq
    li
{text}

Note that the to create a new stat {it:variance} we created a Mata function
called {it:aggregate_variance}. To avoid overlap with other Mata functions,
your function must start with {it:aggregate_}.


{marker author}{...}
{title:Author}

{pstd}Sergio Correia{break}
Board of Governors of the Federal Reserve System, USA{break}
{browse "mailto:sergio.correia@gmail.com":sergio.correia@gmail.com}{break}
{p_end}


{marker project}{...}
{title:More Information}

{pstd}{break}
To report bugs, contribute, ask for help, etc. please see the project URL in Github:{break}
{browse "https://github.com/sergiocorreia/ftools"}{break}
{p_end}


{marker acknowledgment}{...}
{title:Acknowledgment}

{pstd}
This help file was based on StataCorp's own help file
for {it:collapse}.
{p_end}

{pstd}
This project was largely inspired by the works of
{browse "http://wesmckinney.com/blog/nycpython-1102012-a-look-inside-pandas-design-and-development/":Wes McKinney}, 
{browse "http://www.stata.com/meeting/uk15/abstracts/":Andrew Maurer}
and
{browse "https://ideas.repec.org/c/boc/bocode/s455001.html":Benn Jann}.
{p_end}