help gdsum -------------------------------------------------------------------------------
Title
gdsum -- Summarize grouped data
Syntax
gdsum varlist [if] [in] [, options]
options Description ------------------------------------------------------------------------- remove("chars") remove chars from value labels median additionally calculate median min(#) set lower boundary of first class to # max(#) set upper boundary of last class to # comma treat commas in value labels as decimal point matrix(matname) return output in matrix r(matname) novarlabel use variable names in output matrix format(%fmt) set format for output ------------------------------------------------------------------------- by is allowed
Description
gdsum calculates the mean and standard deviation for grouped data. The median may be calculated optionally. Results are displayed in a matrix and returned in r(). The lower and upper boundaries of the (disjoint) classes are needed to calculate the statistics. The way to provide this information is to define value labels for each class.
The value labels must contain the lower and upper boundary of each class and may contain any other characters. The boundaries must be given in the correct order and all other characters (before the upper bound) must be specified in the remove() option.
Options
+---------+ ----+ options +----------------------------------------------------------
remove("chars") specifies the characters to be removed from value labels. The default is "-" and it is added to the characters specified. Commas in value labels are ignored, unless comma is specified. Double quotes may be omitted.
median additionally calculates and returns the median.
min(#) sets the lower boundary of the first class to #. This option may be used if the lower bound of the first class is not given in the value label. In this case min(0) is the default.
max(#) sets the upper boundary of the last class to #. This option may be used if the upper bound of the last class is not given in the value label. In this case max() is set to the lower bound of the last class.
comma treats commas in value labels as decimal point. The default is to ignore commas. This option temporarily set dp comma.
matrix(matname) returns the result matrix in matname.
novarlabel uses variable names in the output. Default is to use variable labels. Labels are abbreviated to 32 characters.
format(%fmt) sets the display format.
Examples
Example 1
. tabulate inc
Income per month | Freq. Percent Cum. ------------------+----------------------------------- $ 0 to 499.99 | 4 26.67 26.67 $ 500 to 999.99 | 6 40.00 66.67 $ 1000 to 2499.99 | 3 20.00 86.67 $ 2500 to 5000 | 2 13.33 100.00 ------------------+----------------------------------- Total | 15 100.00
. gdsum inc ,remove($ to)
| Mean SD Obs -----------------+--------------------------------- Income per month | 1216.662 1156.762 15
Example 2
. tabulate inc
Einkommen | Freq. Percent Cum. ------------------+----------------------------------- 0-499,99 Euro | 4 26.67 26.67 500-999,99 Euro | 6 40.00 66.67 1000-2499,99 Euro | 3 20.00 86.67 2500-5000 Euro | 2 13.33 100.00 ------------------+----------------------------------- Total | 15 100.00
. gdsum inc ,novarlabel
| Mean SD Obs -------------+--------------------------------- inc | 52366.23 41226.84 15
Note that "-" and "Euro" do not have to be removed, because "-" is the default setting and "Euro" follows the upper boundary. The result is not equal to the result above, because a comma is used as decimal point. The default setting is to remove commas, so "499,99" becomes "49999". In order to get the correct result you have to specify comma.
. gdsum inc ,novarlabel comma
| Mean SD Obs -------------+--------------------------------- inc | 1216,662 1156,762 15
Example 3
. tabulate inc
Income p. month | Freq. Percent Cum. --------------------+----------------------------------- up to USD 499.99 | 4 26.67 26.67 USD 500 to 999.99 | 6 40.00 66.67 USD 1000 to 2499.99 | 3 20.00 86.67 more than USD 2500 | 2 13.33 100.00 --------------------+----------------------------------- Total | 15 100.00
. gdsum inc ,remove(up to USD "more than") median inc: lower boundary set to 0 inc: upper boundary set to 2500
| Mean SD p50 Obs ---------------+-------------------------------------------- Income p month | 1049.996 791.6993 791.6608 15
Note that the lower boundary is missing in the first class and the upper boundary is missing in the last class. Since min() and max() are not specified, gdsum uses "0" as the lower bound of the first and "2500" as upper bound of the last class. To get the above result, specify max(5000).
. gdsum inc ,remove(up to USD "more than") median max(5000) inc: lower boundary set to 0 inc: upper boundary set to 5000
| Mean SD p50 Obs ---------------+-------------------------------------------- Income p month | 1216.662 1156.762 791.6608 15
Saved results
gdsum saves the following in r():
Scalars r(mean_varname) mean r(sd_varname) standard deviation r(N_varname) non-missing observations r(p50_varname) median (median only)
Matrices r(matname) result matrix (matrix() only)
Formulas
The mean is calculated as
(1) M = (A' * F) / n
with A = (el1 el2 ... elk)' el = ((lb + ub)/2) F = (freq1 freq2 ... freqk)'
A and F are k x 1 vectors, where k is the number of classes. In el, lb and ub are the lower and upper boundaries of the class intervals. In F, freq is the class frequency. The number of non-missing observations is n.
The standard deviation is calculated as
(2) sd = sqrt((1/n-1) * (B' * F))
with B = (el1 el2 ... elk)' el = ((xm - M)^2)
In el, xm is the mid-point of the class interval (i.e. the elements of A'), M is the mean - as calculated in (1).
The median is calculated as
(3) p50 = lbm + ((n/2 - cf) / nmcl) * (ubm-lbm)
where lbm is the lower boundary of the median class, cf is the cumulative frequency of the class prior to the median class, nmcl is the number of observations in the median class and ubm is the upper boundary of the median class.
Author
Daniel Klein, University of Bamberg, daniel1.klein@gmx.de
Also see
Online: summarize