help gsum -------------------------------------------------------------------------------


gsum -- Summary statistics for grouped data

Syntax gsum varlist [if] [in] [weight] , [options] [group definitions]


Specifiying Group Ranges

gsum accepts variables with codes from 0 to 25 (integers only).

Elements of group definitions can be g0(#-#), g1(#-#)...g#(#-#) where g# identifies the group number and (#-#) identifies a numeric range.

If, however, varlist has each category labeled in the format of #-#, gsum can simply use these values.

If you do not specify group definitions, gsum will look for labels.

If you do specify group definitions, gsum will ignore the labels.

options Description -------------------------------------------------------------------------

quantiles(q q ...) the set of quantiles to be calculated, the default set is 0.25, 0.50, and 0.75.

gen(newvarlist) create new variable called newvarlist containing the midpoints.

table display the value table.

save(filename) save the value table to filename.



gsum calculates summary statistics for an ordinal variable where each category represents a range of a conceptually continuous variable. gsum provides the weighted N, the mean, the standard deviation, and quantiles 0.25, 0.50 (the median), and 0.75 (you can specify any set of quantiles you want). Each quantile is available as both the midpoint of the category in which the quantile falls, or as a linear interpolation of that quantile based on methods presented by Blalock (1979).

gsum can also produce a value table (which can also be saved) listing each category, the range, the midpoint of that range, the number of cases, the weight of each case, and the cumulative distribution function (CDF).

For an extra tool, gsum can also create a new variable that contains the midpoints.

gsum accepts any type of [weight] and is byable.

For example, you may have a variable age_cat where 1 represents 18-24 years of age, 2 represents 25-44 years of age, and 3 represents 45-100 years of age. You can use gsum to calculate summary statistics such as the mean, median, and standard deviation.


Use the 2010 GSS data on age

. use gssage.dta, clear

If the variable age_cat is labled correctly,

. gsum age_cat

Or, if you are not sure,

. gsum age_cat, g1(18-24) g2(25-44) g3(45-100)

To use weights,

. gsum age_cat [pweight = wtssall]

To see the value table,

. gsum age_cat, table

To save the value table in the file valuetable.dta,

. gsum age_cat, save(valuetable.dta)

To create the variable midpoint_age_cat,

. gsum age_cat, gen(midpoint_age_cat)

You can also enter in data from a frequency table. For example, there is a table in Blalock (1979) that shows the frequency of cases for different income ranges:

Income Range Frequency ----------------------- 1950-2950 17 2950-3950 26 3950-4950 38 4950-5950 51 5950-6950 36 6950-7950 21 ----------------------- Total 189

You can input this table into Stata as a categorical variable and frequencies:

. clear . input y f 1. 1 17 2. 2 26 3. 3 38 4. 4 51 5. 5 36 6. 6 21 7. end

You can then label the categories

. label def money 1 "1950-2950" 2 "2950-3950" 3 "3950-4950" 4 "4950-5950" 5 "5950-6950" 6 "6950-7950" . label val y money

Then use frequency weights

. gsum y [fweight = f], table quantiles(0.50)

Saved results

gsum saves the following in r():

Scalars r(N) the number of observations r(sum_W) the sum of the weights r(mean) the mean r(var) the variance r(sd) the standard deviation r(mn) the minimum r(mx) the maximum r(qiq) the q quantile using the interpolation method r(qmq) the q quantile using the midpoint method


The algorithms used in this program are based on Blalock, H.M. 1979. Social Statistics. 2nd Ed. McGraw-Hill: New York


This program was written by Eric Hedberg, National Opinion Research Center at the University of Chicago. Any questions or comments can be directed to