help gsum -------------------------------------------------------------------------------
Title
gsum -- Summary statistics for grouped data
Syntax gsum varlist [if] [in] [weight] , [options] [group definitions]
-----------------------------------------------------------------------------
Specifiying Group Ranges
gsum accepts variables with codes from 0 to 25 (integers only).
Elements of group definitions can be g0(#-#), g1(#-#)...g#(#-#) where g# identifies the group number and (#-#) identifies a numeric range.
If, however, varlist has each category labeled in the format of #-#, gsum can simply use these values.
If you do not specify group definitions, gsum will look for labels.
If you do specify group definitions, gsum will ignore the labels.
options Description -------------------------------------------------------------------------
quantiles(q q ...) the set of quantiles to be calculated, the default set is 0.25, 0.50, and 0.75.
gen(newvarlist) create new variable called newvarlist containing the midpoints.
table display the value table.
save(filename) save the value table to filename.
-------------------------------------------------------------------------
Description
gsum calculates summary statistics for an ordinal variable where each category represents a range of a conceptually continuous variable. gsum provides the weighted N, the mean, the standard deviation, and quantiles 0.25, 0.50 (the median), and 0.75 (you can specify any set of quantiles you want). Each quantile is available as both the midpoint of the category in which the quantile falls, or as a linear interpolation of that quantile based on methods presented by Blalock (1979).
gsum can also produce a value table (which can also be saved) listing each category, the range, the midpoint of that range, the number of cases, the weight of each case, and the cumulative distribution function (CDF).
For an extra tool, gsum can also create a new variable that contains the midpoints.
gsum accepts any type of [weight] and is byable.
For example, you may have a variable age_cat where 1 represents 18-24 years of age, 2 represents 25-44 years of age, and 3 represents 45-100 years of age. You can use gsum to calculate summary statistics such as the mean, median, and standard deviation.
Examples
Use the 2010 GSS data on age
. use gssage.dta, clear
If the variable age_cat is labled correctly,
. gsum age_cat
Or, if you are not sure,
. gsum age_cat, g1(18-24) g2(25-44) g3(45-100)
To use weights,
. gsum age_cat [pweight = wtssall]
To see the value table,
. gsum age_cat, table
To save the value table in the file valuetable.dta,
. gsum age_cat, save(valuetable.dta)
To create the variable midpoint_age_cat,
. gsum age_cat, gen(midpoint_age_cat)
You can also enter in data from a frequency table. For example, there is a table in Blalock (1979) that shows the frequency of cases for different income ranges:
Income Range Frequency ----------------------- 1950-2950 17 2950-3950 26 3950-4950 38 4950-5950 51 5950-6950 36 6950-7950 21 ----------------------- Total 189
You can input this table into Stata as a categorical variable and frequencies:
. clear . input y f 1. 1 17 2. 2 26 3. 3 38 4. 4 51 5. 5 36 6. 6 21 7. end
You can then label the categories
. label def money 1 "1950-2950" 2 "2950-3950" 3 "3950-4950" 4 "4950-5950" 5 "5950-6950" 6 "6950-7950" . label val y money
Then use frequency weights
. gsum y [fweight = f], table quantiles(0.50)
Saved results
gsum saves the following in r():
Scalars r(N) the number of observations r(sum_W) the sum of the weights r(mean) the mean r(var) the variance r(sd) the standard deviation r(mn) the minimum r(mx) the maximum r(qiq) the q quantile using the interpolation method r(qmq) the q quantile using the midpoint method
Acknowledgments
The algorithms used in this program are based on Blalock, H.M. 1979. Social Statistics. 2nd Ed. McGraw-Hill: New York
Contact
This program was written by Eric Hedberg, National Opinion Research Center at the University of Chicago. Any questions or comments can be directed to ech@uchicago.edu.