-------------------------------------------------------------------------------
help for gbgfit                                      Austin Nichols (June 2009)
-------------------------------------------------------------------------------

Program to fit a Generalized Beta (Type 2) distribution to grouped data via ML

        gbgfit nvar [if exp] [in range] [, z1(z1var) z2(z2var) from(string)
                 avar(var) bvar(var) pvar(var) qvar(var) sva(var) svb(var)
                 svp(var) svq(var) replace double level(#) maximize_options ]

    by ... : may be used with gbgfit; see help by.

Description

    gbgfit fits by ML the 3 parameter Generalized Beta (Type 2), or GB2,
    distribution to a sample of counts or frequencies in nvar by income
    category. (For an estimator designed for unit record data, see gb2fit).
    Optionally specified variables z1var and z2var encode the lower and upper
    limits of each interval; if they are not specified, variables serving
    that role and called z1 and z2 are assumed to exist.

    The GB2 distribution seems to provide a good fit to empirical income data
    relative to other parametric functional forms; see e.g. McDonald (1984).
    It includes the 3-parameter Singh-Maddala (Burr Type 12) and Dagum (Burr
    Type 3) distributions as special cases; see Singh-Maddala (1976) and
    Dagum (1977,1980).  The Singh-Maddala distribution is the special case
    when parameter p = 1, and the Dagum distribution is the special case when
    parameter q = 1.  See also help for dagfit or smgfit if installed (or see
    dagumfit and smfit for estimators designed for unit record data).  The
    GB2 distribution is also known as the Generalized F distribution
    (differently parameterized) or the Feller-Pareto distribution (with a
    nonzero minimum).  For a comprehensive review of these and other related
    distributions, see Kleiber and Kotz (2003).  The GB2 distribution may be
    useful for describing any skewed positive variable.

    To test whether a given distribution assumed to be well-modeled as a GB2
    distribution can also be usefully modeled as a Singh-Maddala or Dagum
    distribution, one might want to test the p and q parameters for equality
    to one, as shown in the example.  Since q = 1 approximately (and we
    cannot reject the null that q = 1) in the estimates shown, it seems
    reasonable to model the distribution as Dagum.  Looking at graphs, we can
    see how closely the Dagum and GB2 estimates of the p.d.f. correspond in
    the example.


Example from McDonald (1984)

clear all
input z1 f70 f75 f80
    0    66  35  21
    2.5  125 85  41
    5    152 106 62
    7.5  166 106 65
    10   158 114 73
    12.5 110 109 69
    15   131 188 140
    20   46  116 137
    25   30  95  198
    35   11  32  128
    50   05  14  67
end
g z2=z1[_n+1]
g mid=(z1+z2)/2
replace mid=100 if mid==.  
set obs 200
g x=_n/2
la var x "Income"
gbgfit f70, difficult
test [p]_b[_cons]=1
test [q]_b[_cons]=1
g g=(e(ba)*x^(e(ba)*e(bp)-1))/e(bb)^(e(ba)*e(bp))/(1+(x/e(bb))^e(ba))^(e(bp)+e(
> bq))*exp(lngamma(e(bp)+e(bq))-lngamma(e(bp))-lngamma(e(bq)))
la var g "PDF for grouped GB2 MLE 1970"
gb2fit mid [fw=f70], from(4 20 .3 .6, copy) difficult
g g2=(e(ba)*x^(e(ba)*e(bp)-1))/e(bb)^(e(ba)*e(bp))/(1+(x/e(bb))^e(ba))^(e(bp)+e
> (bq))*exp(lngamma(e(bp)+e(bq))-lngamma(e(bp))-lngamma(e(bq)))
la var g2 "PDF for indiv. GB2 MLE applied to group data from 1970"
dagfit f70 
g dg=e(ba)*e(bp)*(e(bb)/x)^e(ba)/x*(1+(e(bb)/x)^e(ba))^(-e(bp)-1)
la var dg "PDF for grouped Dagum MLE 1970"
smgfit f70
g sg=(e(ba)*e(bq)/e(bb))*((1+(x/e(bb))^e(ba))^-(e(bq)+1))*((x/e(bb))^(e(ba)-1))
la var sg "PDF for grouped Singh-Maddala MLE 1970"
tw hist mid [fw=f70]||line g x, name(hist)
line g x||line g2 x||line dg x||line sg x, leg(col(1))
g dd=g-dg
g ds=g-sg
la var dd "Diff. betw. GB2 and Dagum PDF"
la var ds "Diff. betw. GB2 and S-M PDF"
line dd ds x, leg(col(1)) name(dpdf)

Options

    z1(z1var), z2(z2var) are the lower and upper limits of each interval; if
        they are not specified, variables serving that role and called z1 and
        z2 are assumed to exist. It should always be true that the upper
        bound of one category is the same as the lower bound of the next
        highest category.

    ?var(var) options specify varlists that are assumed to have a linear
        effect on the parameter specified.  By default, each varlist contains
        only a constant, so only the parameter itself is estimated.

    sv?(var) options specify newvars in which to store the estimated
        parameters.  If ?var(var) options have been specified, the prediction
        assumes all explanatory variables are zero, i.e. the prediction is
        only for the constant.

    replace allows sv?(var) options to specify existing variables, whose
        values are replaced in the estimation sample.

    double requests that sv?(var) create doubles.

    from(string) specifies initial values for the parameters, and it may be
        useful to try different starting parameters to assess the dependence
        of estimates on starting values (the generalized beta is more
        sensitive to initial parameter vectors than the Dagum or
        Singh-Maddala), or to aid convergence speed. You can specify the
        initial values in one of three ways: the name of a vector containing
        the initial values (e.g., from(b0) where b0 is a properly labeled
        vector); by specifying coefficient names with the values (e.g.,
        from(a:_cons=1 b:_cons=5 p:_cons = 1 q:_cons = 1); or by specifying
        an ordered list of values (e.g., from(1 5 1 1, copy)).  Poor values
        in from() may lead to convergence problems. For more details,
        including the use of copy and skip, see {help:maximize}.

    level(#) specifies the confidence level, in percent, for the confidence
        intervals of the coefficients; see help level.

    nolog suppresses the iteration log.

    maximize_options control the maximization process. The options available
        are those shown by maximize, with the exception of from().  If you
        are seeing many "(not concave)" messages in the iteration log, using
        the difficult or technique options may help convergence.

Saved results

    In addition to the usual results saved after ml, dagfit also saves the
    following:

    e(a), e(b), e(p), and e(q) are the estimated GB2 parameters. If
    covariates are specified, these are the parameters when all covariates
    are zero; i.e.  these are the constant terms in each equation.  These
    parameters are used to calculate e(mode), e(mean), e(var), e(sd), e(i2),
    and e(gini), which are the estimated mode, mean, variance, standard
    deviation, half coefficient of variation squared, and Gini coefficient,
    respectively. e(pX), and e(LpX) are the quantiles, and Lorenz ordinates,
    where X ranges from 1 to 99.

Formulae

    The GB2 distribution has distribution function (c.d.f.)

        F(x) = ibeta(p, q, (x/b)^a/(1+(x/b)^a) )

    where a, b, p, q, are parameters, each positive, for random variable x >
    0.  Parameters a, p, and q are the key distributional 'shape' parameters;
    b is a scale parameter.

    The GB2 distribution has density

        f(x) = ax^(ap-1)*{(b^(a*p))*B(p,q)*[1 + (x/b)^a ]^(p+q)}^-1.

    The likelihood function for a sample of observations on nvar is specified
    as the product of the density integrated from z1 to z2 and raised to the
    power nvar, the count of observations in the category, and is maximized
    using ml model lf.

    The formulae used to derive the distributional summary statistics
    presented (optionally) are as follows. The r-th moment about the origin
    is given by

        b^r*B(p+r/a,q-r/a)/B(p,q)

    where B(u,v) is the Beta function = G(u)*G(v)/G(u+v) and
    G(.)=exp(lngamma(.)) is the gamma function. The moments exist for -ap < r
    < aq. By substitution and using G(1) = 1, the moments can be written

        b^r*G(p+r/a)*G(q-r/a)/[G(p)G(q)]

    and hence

        mean = b*G(p+1/a)*G(q-1/a)/[G(p)G(q)]

        variance = b*b*G(p+2/a)*G(q-2/a)/[G(p)G(q)] - (mean^2)

    from which the standard deviation and half the squared coefficient of
    variation can be derived. The mode is

        mode = b*((ap-1)/(aq+1))^(1/a) if ap > 1, and 0 otherwise.

    The quantiles are derived by inverting the distribution function, and
    calculation of the Lorenz ordinates exploits the fact that they follow a
    GB2 distribution; see Kleiber and Kotz, 2003, eqn (6.23). The Gini
    coefficient is not calculated as this requires evaluation of the
    generalized hypergeometric 3F2, and this function is not currently
    available in Stata.  Online evaluators are available, at e.g. 
    wolfram.com, where you can plug in specific parameter values to calculate
    the generalized hypergeometric 3F2, then use the formula given by
    McDonald (1984) to calculate the Gini.


Author

    Austin Nichols <austinnichols@gmail.com>

Acknowledgements

    Stephen P. Jenkins has made available commands for fitting various
    distributions (including the GB2) to individual record data; see Jenkins
    (2004).  This package draws liberally from that work; note in particular
    the similarity of verbiage under headings Formulae and Description.

References

    Dagum, C. (1977). A new model of personal income distribution:
        specification and estimation. Economie Appliquée 30: 413-437.

    Dagum, C. (1980). The generation and distribution of income, the Lorenz
        curve and the Gini ratio. Economie Appliquée 33: 327-367.

    Jenkins, S.P. (2004). Fitting functional forms to distributions, using
        ml. Presentation at Second German Stata Users Group Meeting, Berlin. 
        http://www.stata.com/meeting/2german/Jenkins.pdf

    Kleiber, C. (1996). Dagum vs. Singh-Maddala income distributions.
        Economics Letters 53: 265-268. 
        http://dx.doi.org/10.1016/S0165-1765(96)00937-8

    Kleiber, C. and Kotz, S. (2003). Statistical Size Distributions in
        Economics and Actuarial Sciences.  Hoboken, NJ: John Wiley.

    McDonald, J.B. (1984). Some generalized functions for the size
        distribution of income. Econometrica 52: 647-663. 
        http://www.jstor.org/stable/1913469

    Singh, S.K. and G.S. Maddala (1976). A function for the size distribution
        of income. Econometrica 44: 963-970.

Also see

    Online: help for dagfit, smgfit, dagumfit, smfit, gb2fit, lognfit, if
             installed, or acquire them from ssc.