{smcl}
{* version 1.0.2 08dec2010}
{cmd:help gdsum}
{hline}
{title:Title}
{p 5}
{cmd:gdsum} {hline 2} Summarize grouped data
{title:Syntax}
{p 8}
{cmd:gdsum} {varlist} {ifin} [{cmd:,} {it:options}]
{synoptset 21 tabbed}{...}
{synopthdr}
{synoptline}
{synopt:{opt rem:ove("chars")}}remove {it:chars} from value labels{p_end}
{synopt:{opt med:ian}}additionally calculate median{p_end}
{synopt:{opt min(#)}}set lower boundary of first class to {it:#}{p_end}
{synopt:{opt max(#)}}set upper boundary of last class to {it:#}{p_end}
{synopt:{opt com:ma}}treat commas in value labels as decimal point{p_end}
{synopt:{opt mat:rix(matname)}}return output in matrix {cmd:r(}{it:matname}{cmd:)}{p_end}
{synopt:{opt novarl:abel}}use variable names in output matrix{p_end}
{synopt:{opt f:ormat(%fmt)}}set {help format} for output{p_end}
{synoptline}
{p 4}
{helpb by} is allowed
{title:Description}
{pstd}
{cmd:gdsum} calculates the mean and standard deviation for grouped data. The median may be
calculated optionally. Results are displayed in a matrix and returned in {cmd:r()}. The
lower and upper boundaries of the (disjoint) classes are needed to calculate the
statistics. The way to provide this information is to define value labels for each class.
{pstd}
The value labels must contain the lower {hi:and} upper boundary of each class and may
contain any other characters. The boundaries must be given in the correct order and all
other characters (before the upper bound) must be specified in the {opt remove()} option.
{title:Options}
{dlgtab:options}
{phang}
{opt remove("chars")} specifies the characters to be removed from value labels. The
default is {hi:"-"} and it is added to the characters specified. Commas in value labels
are ignored, unless {opt comma} is specified. Double quotes may be omitted.
{phang}
{opt median} additionally calculates and returns the median.
{phang}
{opt min(#)} sets the lower boundary of the first class to {it:#}. This option may be used
if the lower bound of the first class is not given in the value label. In this case
{opt min(0)} is the default.
{phang}
{opt max(#)} sets the upper boundary of the last class to {it:#}. This option may be used
if the upper bound of the last class is not given in the value label. In this case
{opt max()} is set to the lower bound of the last class.
{phang}
{opt comma} treats commas in value labels as decimal point. The default is to ignore
commas. This option temporarily {help set dp} {cmd: comma}.
{phang}
{opt matrix(matname)} returns the result matrix in {it:matname}.
{phang}
{opt novarlabel} uses variable names in the output. Default is to use variable labels.
Labels are abbreviated to 32 characters.
{phang}
{opt format(%fmt)} sets the display format.
{title:Examples}
{pstd}
Example 1
. tabulate inc
Income per month | Freq. Percent Cum.
------------------+-----------------------------------
$ 0 to 499.99 | 4 26.67 26.67
$ 500 to 999.99 | 6 40.00 66.67
$ 1000 to 2499.99 | 3 20.00 86.67
$ 2500 to 5000 | 2 13.33 100.00
------------------+-----------------------------------
Total | 15 100.00
{cmd:. gdsum inc ,remove($ to)}
| Mean SD Obs
-----------------+---------------------------------
Income per month | 1216.662 1156.762 15
{pstd}
Example 2
. tabulate inc
Einkommen | Freq. Percent Cum.
------------------+-----------------------------------
0-499,99 Euro | 4 26.67 26.67
500-999,99 Euro | 6 40.00 66.67
1000-2499,99 Euro | 3 20.00 86.67
2500-5000 Euro | 2 13.33 100.00
------------------+-----------------------------------
Total | 15 100.00
{cmd:. gdsum inc ,novarlabel}
| Mean SD Obs
-------------+---------------------------------
inc | 52366.23 41226.84 15
{pstd}
Note that "-" and "Euro" do not have to be removed, because "-" is the default setting and
"Euro" follows the upper boundary. The result is not equal to the result above, because a
comma is used as decimal point. The default setting is to remove commas, so "499,99"
becomes "49999". In order to get the correct result you have to specify {opt comma}.
{cmd:. gdsum inc ,novarlabel comma}
| Mean SD Obs
-------------+---------------------------------
inc | 1216,662 1156,762 15
{pstd}
Example 3
. tabulate inc
Income p. month | Freq. Percent Cum.
--------------------+-----------------------------------
up to USD 499.99 | 4 26.67 26.67
USD 500 to 999.99 | 6 40.00 66.67
USD 1000 to 2499.99 | 3 20.00 86.67
more than USD 2500 | 2 13.33 100.00
--------------------+-----------------------------------
Total | 15 100.00
{cmd:. gdsum inc ,remove(up to USD "more than") median}
inc: lower boundary set to 0
inc: upper boundary set to 2500
| Mean SD p50 Obs
---------------+--------------------------------------------
Income p month | 1049.996 791.6993 791.6608 15
{pstd}
Note that the lower boundary is missing in the first class and the upper boundary is
missing in the last class. Since {opt min()} and {opt max()} are not specified,
{cmd:gdsum} uses "0" as the lower bound of the first and "2500" as upper bound of the last
class. To get the above result, specify {opt max(5000)}.
{cmd:. gdsum inc ,remove(up to USD "more than") median max(5000)}
inc: lower boundary set to 0
inc: upper boundary set to 5000
| Mean SD p50 Obs
---------------+--------------------------------------------
Income p month | 1216.662 1156.762 791.6608 15
{title:Saved results}
{pstd}
{cmd:gdsum} saves the following in {cmd:r()}:
{pstd}
Scalars{p_end}
{synopt:{cmd:r(mean_varname)}}mean{p_end}
{synopt:{cmd:r(sd_varname)}}standard deviation{p_end}
{synopt:{cmd:r(N_varname)}}non-missing observations{p_end}
{synopt:{cmd:r(p50_varname)}}median ({opt median} only){p_end}
{pstd}
Matrices{p_end}
{synopt:{cmd:r(matname)}}result matrix ({opt matrix()} only){p_end}
{title:Formulas}
{pstd}
The mean is calculated as
{p 8}
(1) {it:M = (A' * F) / n}{p_end}
{pstd}
with{p_end}
{p 8}
{it:A = (el1 el2 ... elk)'}{p_end}
{p 7}
{it:el = ((lb + ub)/2)}{p_end}
{p 8}
{it:F} = (freq1 freq2 ... freqk)'
{pstd}
{it:A} and {it:F} are {it:k} x 1 vectors, where {it:k} is the number of classes. In
{it:el}, {it:lb} and {it:ub} are the lower and upper boundaries of the class intervals. In
{it:F}, {it:freq} is the class frequency. The number of non-missing observations is
{it:n}.
{pstd}
The standard deviation is calculated as
{p 8}
(2) {it:sd = sqrt((1/n-1) * (B' * F))}{p_end}
{pstd}
with{p_end}
{p 8}
{it:B = (el1 el2 ... elk)'}{p_end}
{p 7}
{it:el = ((xm - M)^2)}{p_end}
{pstd}
In {it:el}, {it:xm} is the mid-point of the class interval (i.e. the elements of A'), {it:M}
is the mean - as calculated in (1).
{pstd}
The median is calculated as
{p 8}
(3) {it:p50 = lbm + ((n/2 - cf) / nmcl) * (ubm-lbm)}{p_end}
{pstd}
where {it:lbm} is the lower boundary of the median class, {it:cf} is the cumulative
frequency of the class prior to the median class, {it:nmcl} is the number of observations
in the median class and {it:ubm} is the upper boundary of the median class.
{title:Author}
{pstd}Daniel Klein, University of Bamberg, daniel1.klein@gmx.de
{title:Also see}
{psee}
Online: {help summarize}
{p_end}