help gdsum
-------------------------------------------------------------------------------

Title

     gdsum -- Summarize grouped data

Syntax

        gdsum varlist [if] [in] [, options]

    options                Description
    -------------------------------------------------------------------------
      remove("chars")      remove chars from value labels
      median               additionally calculate median
      min(#)               set lower boundary of first class to #
      max(#)               set upper boundary of last class to #
      comma                treat commas in value labels as decimal point
      matrix(matname)      return output in matrix r(matname)
      novarlabel           use variable names in output matrix
      format(%fmt)         set format for output
    -------------------------------------------------------------------------
    by is allowed


Description

    gdsum calculates the mean and standard deviation for grouped data. The
    median may be calculated optionally. Results are displayed in a matrix
    and returned in r(). The lower and upper boundaries of the (disjoint)
    classes are needed to calculate the statistics. The way to provide this
    information is to define value labels for each class.

    The value labels must contain the lower and upper boundary of each class
    and may contain any other characters. The boundaries must be given in the
    correct order and all other characters (before the upper bound) must be
    specified in the remove() option.

Options

        +---------+
    ----+ options +----------------------------------------------------------

    remove("chars") specifies the characters to be removed from value labels.
        The default is "-" and it is added to the characters specified.
        Commas in value labels are ignored, unless comma is specified. Double
        quotes may be omitted.

    median additionally calculates and returns the median.

    min(#) sets the lower boundary of the first class to #. This option may
        be used if the lower bound of the first class is not given in the
        value label. In this case min(0) is the default.

    max(#) sets the upper boundary of the last class to #. This option may be
        used if the upper bound of the last class is not given in the value
        label. In this case max() is set to the lower bound of the last
        class.

    comma treats commas in value labels as decimal point. The default is to
        ignore commas. This option temporarily set dp comma.

    matrix(matname) returns the result matrix in matname.

    novarlabel uses variable names in the output. Default is to use variable
        labels.  Labels are abbreviated to 32 characters.

    format(%fmt) sets the display format.


Examples

    Example 1

        . tabulate inc

         Income per month |      Freq.     Percent        Cum.
        ------------------+-----------------------------------
            $ 0 to 499.99 |          4       26.67       26.67
          $ 500 to 999.99 |          6       40.00       66.67
        $ 1000 to 2499.99 |          3       20.00       86.67
           $ 2500 to 5000 |          2       13.33      100.00
        ------------------+-----------------------------------
                    Total |         15      100.00

        . gdsum inc ,remove($ to)

                         |      Mean         SD        Obs 
        -----------------+---------------------------------
        Income per month |  1216.662   1156.762         15 


    Example 2

        . tabulate inc

                Einkommen |      Freq.     Percent        Cum.
        ------------------+-----------------------------------
            0-499,99 Euro |          4       26.67       26.67
          500-999,99 Euro |          6       40.00       66.67
        1000-2499,99 Euro |          3       20.00       86.67
           2500-5000 Euro |          2       13.33      100.00
        ------------------+-----------------------------------
                    Total |         15      100.00

        . gdsum inc ,novarlabel

                     |      Mean         SD        Obs 
        -------------+---------------------------------
                 inc |  52366.23   41226.84         15 


    Note that "-" and "Euro" do not have to be removed, because "-" is the
    default setting and "Euro" follows the upper boundary. The result is not
    equal to the result above, because a comma is used as decimal point. The
    default setting is to remove commas, so "499,99" becomes "49999". In
    order to get the correct result you have to specify comma.

        . gdsum inc ,novarlabel comma

                     |      Mean         SD        Obs 
        -------------+---------------------------------
                 inc |  1216,662   1156,762         15 


    Example 3

        . tabulate inc

            Income p. month |      Freq.     Percent        Cum.
        --------------------+-----------------------------------
           up to USD 499.99 |          4       26.67       26.67
          USD 500 to 999.99 |          6       40.00       66.67
        USD 1000 to 2499.99 |          3       20.00       86.67
         more than USD 2500 |          2       13.33      100.00
        --------------------+-----------------------------------
                      Total |         15      100.00

        . gdsum inc ,remove(up to USD "more than") median
        inc: lower boundary set to 0
        inc: upper boundary set to 2500

                       |      Mean         SD        p50        Obs 
        ---------------+--------------------------------------------
        Income p month |  1049.996   791.6993   791.6608         15 


    Note that the lower boundary is missing in the first class and the upper
    boundary is missing in the last class. Since min() and max() are not
    specified, gdsum uses "0" as the lower bound of the first and "2500" as
    upper bound of the last class. To get the above result, specify
    max(5000).

        . gdsum inc ,remove(up to USD "more than") median max(5000)
        inc: lower boundary set to 0
        inc: upper boundary set to 5000

                       |      Mean         SD        p50        Obs 
        ---------------+--------------------------------------------
        Income p month |  1216.662   1156.762   791.6608         15 


Saved results

    gdsum saves the following in r():

    Scalars
      r(mean_varname)      mean
      r(sd_varname)        standard deviation
      r(N_varname)         non-missing observations
      r(p50_varname)       median (median only)

    Matrices
      r(matname)           result matrix (matrix() only)


Formulas

    The mean is calculated as

        (1) M = (A' * F) / n

    with
        A = (el1 el2 ... elk)'
       el = ((lb + ub)/2)
        F = (freq1 freq2 ... freqk)'

    A and F are k x 1 vectors, where k is the number of classes. In el, lb
    and ub are the lower and upper boundaries of the class intervals. In F,
    freq is the class frequency. The number of non-missing observations is n.

    The standard deviation is calculated as

        (2) sd = sqrt((1/n-1) * (B' * F))

    with
        B = (el1 el2 ... elk)'
       el = ((xm - M)^2)

    In el, xm is the mid-point of the class interval (i.e. the elements of
    A'), M is the mean - as calculated in (1).

    The median is calculated as

        (3) p50 = lbm + ((n/2 - cf) / nmcl) * (ubm-lbm)

    where lbm is the lower boundary of the median class, cf is the cumulative
    frequency of the class prior to the median class, nmcl is the number of
    observations in the median class and ubm is the upper boundary of the
    median class.

Author

    Daniel Klein, University of Bamberg, daniel1.klein@gmx.de

Also see

    Online: summarize