Title

gdsum-- Summarize grouped data

Syntax

gdsumvarlist[if] [in] [,options]

optionsDescription -------------------------------------------------------------------------remove("chars")removecharsfrom value labelsmedianadditionally calculate medianmin(#)set lower boundary of first class to#max(#)set upper boundary of last class to#commatreat commas in value labels as decimal pointmatrix(matname)return output in matrixr(matname)novarlabeluse variable names in output matrixformat(%fmt)set format for output -------------------------------------------------------------------------byis allowed

Description

gdsumcalculates the mean and standard deviation for grouped data. The median may be calculated optionally. Results are displayed in a matrix and returned inr(). The lower and upper boundaries of the (disjoint) classes are needed to calculate the statistics. The way to provide this information is to define value labels for each class.The value labels must contain the lower

andupper boundary of each class and may contain any other characters. The boundaries must be given in the correct order and all other characters (before the upper bound) must be specified in theremove()option.

remove("chars")specifies the characters to be removed from value labels. The default is"-"and it is added to the characters specified. Commas in value labels are ignored, unlesscommais specified. Double quotes may be omitted.

medianadditionally calculates and returns the median.

min(#)sets the lower boundary of the first class to#. This option may be used if the lower bound of the first class is not given in the value label. In this casemin(0)is the default.

max(#)sets the upper boundary of the last class to#. This option may be used if the upper bound of the last class is not given in the value label. In this casemax()is set to the lower bound of the last class.

commatreats commas in value labels as decimal point. The default is to ignore commas. This option temporarily set dpcomma.

matrix(matname)returns the result matrix inmatname.

novarlabeluses variable names in the output. Default is to use variable labels. Labels are abbreviated to 32 characters.

format(%fmt)sets the display format.

ExamplesExample 1

. tabulate inc

Income per month | Freq. Percent Cum. ------------------+----------------------------------- $ 0 to 499.99 | 4 26.67 26.67 $ 500 to 999.99 | 6 40.00 66.67 $ 1000 to 2499.99 | 3 20.00 86.67 $ 2500 to 5000 | 2 13.33 100.00 ------------------+----------------------------------- Total | 15 100.00

. gdsum inc ,remove($ to)| Mean SD Obs -----------------+--------------------------------- Income per month | 1216.662 1156.762 15

Example 2

. tabulate inc

Einkommen | Freq. Percent Cum. ------------------+----------------------------------- 0-499,99 Euro | 4 26.67 26.67 500-999,99 Euro | 6 40.00 66.67 1000-2499,99 Euro | 3 20.00 86.67 2500-5000 Euro | 2 13.33 100.00 ------------------+----------------------------------- Total | 15 100.00

. gdsum inc ,novarlabel| Mean SD Obs -------------+--------------------------------- inc | 52366.23 41226.84 15

Note that "-" and "Euro" do not have to be removed, because "-" is the default setting and "Euro" follows the upper boundary. The result is not equal to the result above, because a comma is used as decimal point. The default setting is to remove commas, so "499,99" becomes "49999". In order to get the correct result you have to specify

comma.

. gdsum inc ,novarlabel comma| Mean SD Obs -------------+--------------------------------- inc | 1216,662 1156,762 15

Example 3

. tabulate inc

Income p. month | Freq. Percent Cum. --------------------+----------------------------------- up to USD 499.99 | 4 26.67 26.67 USD 500 to 999.99 | 6 40.00 66.67 USD 1000 to 2499.99 | 3 20.00 86.67 more than USD 2500 | 2 13.33 100.00 --------------------+----------------------------------- Total | 15 100.00

. gdsum inc ,remove(up to USD "more than") medianinc: lower boundary set to 0 inc: upper boundary set to 2500| Mean SD p50 Obs ---------------+-------------------------------------------- Income p month | 1049.996 791.6993 791.6608 15

Note that the lower boundary is missing in the first class and the upper boundary is missing in the last class. Since

min()andmax()are not specified,gdsumuses "0" as the lower bound of the first and "2500" as upper bound of the last class. To get the above result, specifymax(5000).

. gdsum inc ,remove(up to USD "more than") median max(5000)inc: lower boundary set to 0 inc: upper boundary set to 5000| Mean SD p50 Obs ---------------+-------------------------------------------- Income p month | 1216.662 1156.762 791.6608 15

Saved results

gdsumsaves the following inr():Scalars

r(mean_varname)meanr(sd_varname)standard deviationr(N_varname)non-missing observationsr(p50_varname)median (medianonly)Matrices

r(matname)result matrix (matrix()only)

FormulasThe mean is calculated as

(1)

M = (A' * F) / nwith

A = (el1 el2 ... elk)'el = ((lb + ub)/2)F= (freq1 freq2 ... freqk)'

AandFarekx 1 vectors, wherekis the number of classes. Inel,lbandubare the lower and upper boundaries of the class intervals. InF,freqis the class frequency. The number of non-missing observations isn.The standard deviation is calculated as

(2)

sd = sqrt((1/n-1) * (B' * F))with

B = (el1 el2 ... elk)'el = ((xm - M)^2)In

el,xmis the mid-point of the class interval (i.e. the elements of A'),Mis the mean - as calculated in (1).The median is calculated as

(3)

p50 = lbm + ((n/2 - cf) / nmcl) * (ubm-lbm)where

lbmis the lower boundary of the median class,cfis the cumulative frequency of the class prior to the median class,nmclis the number of observations in the median class andubmis the upper boundary of the median class.

AuthorDaniel Klein, University of Bamberg, daniel1.klein@gmx.de

Also seeOnline: summarize