help gsum-------------------------------------------------------------------------------

Title

gsum-- Summary statistics for grouped data

Syntaxgsumvarlist[if] [in] [weight],[options] [group definitions]-----------------------------------------------------------------------------

Specifiying Group Ranges

gsumaccepts variables with codes from 0 to 25 (integers only).Elements of

group definitionscan beg0(#-#),g1(#-#)...g#(#-#)whereg#identifies the group number and(#-#)identifies a numeric range.If, however,

varlisthas each category labeled in the format of#-#,gsumcan simply use these values.If you do not specify

group definitions,gsumwill look for labels.If you do specify

group definitions,gsumwill ignore the labels.

optionsDescription -------------------------------------------------------------------------

quantiles(q q ...)the set of quantiles to be calculated, the default set is 0.25, 0.50, and 0.75.

gen(newvarlist)create new variable callednewvarlistcontaining the midpoints.

tabledisplay the value table.

save(filename)save the value table tofilename.-------------------------------------------------------------------------

Description

gsumcalculates summary statistics for an ordinal variable where each category represents a range of a conceptually continuous variable.gsumprovides the weightedN, the mean, the standard deviation, and quantiles 0.25, 0.50 (the median), and 0.75 (you can specify any set of quantiles you want). Each quantile is available as both the midpoint of the category in which the quantile falls, or as a linear interpolation of that quantile based on methods presented by Blalock (1979).

gsumcan also produce a value table (which can also be saved) listing each category, the range, the midpoint of that range, the number of cases, the weight of each case, and the cumulative distribution function (CDF).For an extra tool,

gsumcan also create a new variable that contains the midpoints.

gsumaccepts any type of [weight] and is byable.For example, you may have a variable

age_catwhere 1 represents 18-24 years of age, 2 represents 25-44 years of age, and 3 represents 45-100 years of age. You can usegsumto calculate summary statistics such as the mean, median, and standard deviation.

ExamplesUse the 2010 GSS data on age

. use gssage.dta, clearIf the variable

age_catis labled correctly,

. gsum age_catOr, if you are not sure,

. gsum age_cat, g1(18-24) g2(25-44) g3(45-100)To use weights,

. gsum age_cat [pweight = wtssall]To see the value table,

. gsum age_cat, tableTo save the value table in the file

valuetable.dta,

. gsum age_cat, save(valuetable.dta)To create the variable

midpoint_age_cat,

. gsum age_cat, gen(midpoint_age_cat)You can also enter in data from a frequency table. For example, there is a table in Blalock (1979) that shows the frequency of cases for different income ranges:

Income Range Frequency ----------------------- 1950-2950 17 2950-3950 26 3950-4950 38 4950-5950 51 5950-6950 36 6950-7950 21 ----------------------- Total 189

You can input this table into Stata as a categorical variable and frequencies:

. clear. input y f1. 1 172. 2 263. 3 384. 4 515. 5 366. 6 217. endYou can then label the categories

. label def money 1 "1950-2950" 2 "2950-3950" 3 "3950-4950" 4 "4950-5950"5 "5950-6950" 6 "6950-7950". label val y moneyThen use frequency weights

. gsum y [fweight = f], table quantiles(0.50)

Saved results

gsumsaves the following inr():Scalars

r(N)the number of observationsr(sum_W)the sum of the weightsr(mean)the meanr(var)the variancer(sd)the standard deviationr(mn)the minimumr(mx)the maximumr(qiq)theqquantile using the interpolation methodr(qmq)theqquantile using the midpoint method

AcknowledgmentsThe algorithms used in this program are based on Blalock, H.M. 1979. Social Statistics. 2nd Ed. McGraw-Hill: New York

ContactThis program was written by Eric Hedberg, National Opinion Research Center at the University of Chicago. Any questions or comments can be directed to ech@uchicago.edu.