-------------------------------------------------------------------------------
help for smgfit                                      Austin Nichols (June 2009)
-------------------------------------------------------------------------------

Program to fit a Singh-Maddala distribution to grouped data via ML

        smgfit nvar [if exp] [in range] [, z1(z1var) z2(z2var) avar(var)
                 bvar(var) qvar(var) sva(var) svb(var) svq(var) replace
                 double from(string) level(#) maximize_options ]

    by ... : may be used with smgfit; see help by.

Description

    smgfit fits by ML the 3 parameter Singh-Maddala (1976) distribution to a
    sample of counts or frequencies in nvar by income category. (For an
    estimator designed for unit record data, see smfit). Optionally specified
    variables z1var and z2var encode the lower and upper limits of each
    interval; if they are not specified, variables serving that role and
    called z1 and z2 are assumed to exist.

    Otherwise known as the Burr Type 12 distribution or as a Beta-P
    distribution (Cronin 1979 and Johnson and Kotz 1970), the Singh-Maddala
    distribution seems to provide a good fit to empirical income data
    relative to other parametric functional forms; see e.g. McDonald (1984).
    The Singh-Maddala distribution is closely related to the Dagum (Burr Type
    3) distribution of Dagum (1977,1980) ; see dagfit (or see dagumfit for an
    estimator designed for unit record data). Both are special cases of the
    Generalized Beta of the Second Kind distribution (see gb2fit for an
    estimator designed for unit record data). For a comprehensive review of
    these and other related distributions, see Kleiber and Kotz (2003).  For
    derivation of Lorenz orderings of pairs of income distributions in terms
    of their Singh-Maddala parameters, see Wifling and Kraemer (1993) and
    Kleiber (1996).  The Singh-Maddala distribution may be useful for
    describing any skewed positive variable.

    The likelihood function for a sample of observations on nvar is specified
    as the product of the integrated densities (between z1 and z2) to the
    nvar power, and is maximized using ml model lf. See McDonald (1984) for
    more information and references.

Example from McDonald (1984)

clear all
input z1 f70 f75 f80
    0    66  35  21
    2.5  125 85  41
    5    152 106 62
    7.5  166 106 65
    10   158 114 73
    12.5 110 109 69
    15   131 188 140
    20   46  116 137
    25   30  95  198
    35   11  32  128
    50   05  14  67
end
g z2=z1[_n+1]
g mid=(z1+z2)/2
replace mid=100 if mid==.  
set obs 200
g x=_n/2
la var x "Income"
dagfit f70 
g dg70=e(ba)*e(bp)*(e(bb)/x)^e(ba)/x*(1+(e(bb)/x)^e(ba))^(-e(bp)-1)
la var dg70 "PDF for grouped Dagum MLE 1970"
tw hist mid [fw=f70]||line dg70 x, name(d)
smgfit f70
g sg70=(e(ba)*e(bq)/e(bb))*((1+(x/e(bb))^e(ba))^-(e(bq)+1))*((x/e(bb))^(e(ba)-1
> ))
la var sg70 "PDF for grouped Singh-Maddala MLE 1970"
tw hist mid [fw=f70]||line sg70 x, name(sm)
dagumfit mid [fw=f70]
g di70=e(ba)*e(bp)*(e(bb)/x)^e(ba)/x*(1+(e(bb)/x)^e(ba))^(-e(bp)-1)
la var di70 "PDF for individual Dagum MLE applied to group data from 1970"
smfit mid [fw=f70]
g si70=(e(ba)*e(bq)/e(bb))*((1+(x/e(bb))^e(ba))^-(e(bq)+1))*((x/e(bb))^(e(ba)-1
> ))
la var si70 "PDF for individual Singh-Maddala MLE applied to group data from 19
> 70"
line dg70 sg70 di70 si70 x, leg(col(1)) scale(.8)

Options

    z1(z1var), z2(z2var) are the lower and upper limits of each interval; if
        they are not specified, variables serving that role and called z1 and
        z2 are assumed to exist. It should always be true that the upper
        bound of one category is the same as the lower bound of the next
        highest category.

    ?var(var) options specify varlists that are assumed to have a linear
        effect on the parameter specified.  By default, each varlist contains
        only a constant, so only the parameter itself is estimated.

    sv?(var) options specify newvars in which to store the estimated
        parameters.  If ?var(var) options have been specified, the prediction
        assumes all explanatory variables are zero, i.e. the prediction is
        only for the constant.

    replace allows sv?(var) options to specify existing variables, whose
        values are replaced in the estimation sample.

    double requests that sv?(var) create doubles.

    from(string) specifies initial values for the Dagum parameters, and is
        likely to be used only rarely. You can specify the initial values in
        one of three ways: the name of a vector containing the initial values
        (e.g., from(b0) where b0 is a properly labeled vector); by specifying
        coefficient names with the values (e.g., from(a:_cons=1 b:_cons=5
        p:_cons = 0); or by specifying an ordered list of values (e.g.,
        from(1 5 0 .16, copy)).  Poor values in from() may lead to
        convergence problems. For more details, including the use of copy and
        skip, see {help:maximize}.

    level(#) specifies the confidence level, in percent, for the confidence
        intervals of the coefficients; see help level.

    nolog suppresses the iteration log.

    maximize_options control the maximization process. The options available
        are those shown by maximize, with the exception of from().  If you
        are seeing many "(not concave)" messages in the iteration log, using
        the difficult or technique options may help convergence.

Saved results

    In addition to the usual results saved after ml, dagfit also saves the
    following:

    e(a), e(b), and e(p) are the estimated Singh-Maddala parameters.

    e(mode), e(mean), e(var), e(sd), e(i2), and e(gini) are the estimated
    mode, mean, variance, standard deviation, half coefficient of variation
    squared, Gini coefficient. e(pX), and e(LpX) are the quantiles, and
    Lorenz ordinates, where X = {1, 5, 10, 20, 25, 30, 40, 50, 60, 70, 75,
    80, 90, 95, 99}.

Formulae

    The Singh-Maddala distribution has distribution function (c.d.f.)

        F(x) = 1-(1+(x/b)^a)^(-q)

    where a, b, q, are parameters, each positive, for random variable x > 0.
    Parameters a and q are the key distributional 'shape' parameters; b is a
    scale parameter.

    The probability density function (p.d.f.) is

        f(x) = (a*q/b)*((1+(x/b)^a)^-(q+1))*((x/b)^(a-1))

    The likelihood function for a sample of observations on nvar is specified
    as the product of the density integrated from z1 to z2 and raised to the
    power nvar, the count of observations in the category, and is maximized
    using ml model lf.

    The formulae used to derive the distributional summary statistics are as
    follows. The r-th moment about the origin is given by

        b^r*B(1+r/a,q-r/a)/B(1,q)

    where B(u,v) is the Beta function = G(u)*G(v)/G(u+v) and
    G(.)=exp(lngamma(.)) is the gamma function, which by substitution and
    using G(1) = 1, implies the moments can be written

        b^r*G(1+r/a)*G(q-r/a)/G(q)

    and hence

        mean = b*G(1+1/a)*G(q-1/a)/G(q)

        variance = b*b*G(1+2/a)*G(q-2/a)/G(q) - (mean^2)

    from which the standard deviation and half the squared coefficient of
    variation can be derived. The mode is

        mode = b*((a-1)/(aq+1))^(1/a) if a > 1, and 0 otherwise.

    The quantiles are derived by inverting the distribution function:

        x_s = b*((1-s)^(-1/q) - 1)^(1/a) for each s = F(x_s).

    The Gini coefficient of inequality is given by

        Gini = 1 - G(q)*G(2q - 1/a) / ( G(q-1/a)*G(2q) ).

    The Lorenz curve ordinates at each s = F(x_s) use the incomplete Beta
    function:

        L(s) = ibeta(1+1/a, q- 1/a, 1-(1-s)^(1/q) ).

        

Author

    Austin Nichols <austinnichols@gmail.com>

Acknowledgements

    Stephen P. Jenkins has made available commands for fitting various
    distributions (including the Singh-Maddala) to individual record data;
    see Jenkins (2004).  This package draws liberally from that work; note in
    particular the similarity of verbiage under headings Formulae and
    Description.


References

    Cronin, D. C. (1979). A Function for the Size Distribution of Income: A
        Further Comment.  Econometrica 47: 773-774.

    Dagum, C. (1977). A new model of personal income distribution:
        specification and estimation. Economie Appliquée 30: 413-437.

    Dagum, C. (1980). The generation and distribution of income, the Lorenz
        curve and the Gini ratio. Economie Appliquée 33: 327-367.

    Jenkins, S.P. (2004). Fitting functional forms to distributions, using
        ml. Presentation at Second German Stata Users Group Meeting, Berlin. 
        http://www.stata.com/meeting/2german/Jenkins.pdf

    Johnson, N. L., and S. Kotz (1970). Continuous Univariate Distributions
        1. New York: John Wiley and Sons.

    Kleiber, C. (1996). Dagum vs. Singh-Maddala income distributions.
        Economics Letters 53: 265-268.

    Kleiber, C. and Kotz, S. (2003). Statistical Size Distributions in
        Economics and Actuarial Sciences.  Hoboken, NJ: John Wiley.

    McDonald, J.B. (1984). Some generalized functions for the size
        distribution of income. Econometrica 52: 647-663.

    Singh, S.K. and G.S. Maddala (1976). A function for the size distribution
        of income. Econometrica 44: 963-970.

    Tadikamalla, P. R. (1980). A Look at the Burr and Related Distributions.
        International Statistical Review 48: 337-349.

    Wifling, B. and W. Kraemer (1993). The Lorenz-ordering of Singh- Maddala
        income distributions. Economics Letters 43: 53-57.

Also see

    Online: help for dagfit, gbgfit, dagumfit, smfit, gb2fit, lognfit, if
             installed, or acquire them from ssc.