```-------------------------------------------------------------------------------
help for gb2fit                                 Stephen P. Jenkins (March 2007)
-------------------------------------------------------------------------------

Fitting a Generalized Beta (Second Kind) distribution by ML to unit record data

gb2fit var [weight] [if exp] [in range] [, avar(varlist1)
bvar(varlist2) pvar(varlist3) qvar(varlist4) abpq(varlist)
stats from(string) poorfrac(#) cdf(cdfname) pdf(pdfname)
robust cluster(varname) svy level(#) maximize_options
svy_options]

by ... : may be used with gb2fit; see help by.

pweights, aweights, fweights, and iweights are allowed; see help weights.
To use pweights, you must first svyset your data and then use the svy
option.

Description

gb2fit fits by ML the 4 parameter Generalized Beta distribution of the
second kind (GB2) to sample observations on a random variable var.  Unit
record data are assumed (rather than grouped data). The GB2 distribution
is also known as the Generalized F distribution (differently
parameterized) or the Feller-Pareto distribution. The Singh-Maddala
(1976) distribution is the special case when parameter p = 1, and the
Dagum (1977, 1980) distribution is the special case when parameter q = 1.
For a comprehensive review of these and other related distributions, see
Kleiber and Kotz (2003). The GB2 distribution has been shown to provide a
good fit to data on income (see e.g. McDonald, 1984) but, of course, it
might also be suitable for describing any skewed variable, not only
income.

The likelihood function for a sample of observations on var is specified
as the product of the densities for each observation (weighted where
relevant), and is maximized using ml model lf.

Options

avar(varlist1), bvar(varlist2), pvar(varlist3), and qvar(varlist4) allow
the user to specify each parameter as a function of the covariates
specified in the respective variable list. A constant term is always
included in each equation.

abpq(varlist) can be used instead of the previous option if the same
covariates are to appear in each parameter equation.

from(string) specifies initial values for the GB2 parameters, and is
likely to be used only rarely. You can specify the initial values in
one of three ways: the name of a vector containing the initial values
(e.g., from(b0) where b0 is a properly labeled vector); by specifying
coefficient names with the values (e.g., from(a:_cons=1 b:_cons=5
p:_cons = 0 q:_cons = .16); or by specifying an ordered list of
values (e.g., from(1 5 0 .16, copy)).  Poor values in from() may lead
to convergence problems. For more details, including the use of copy
and skip, see {help:maximize}.

If covariates are specified, the next four options are not available.
Use gb2pred to generate statistics at particular values of the
covariates, or nlcom. predict can be used to generate the
observation-specific parameters corresponding to the covariate values
of each sample observation: see Examples below.

stats displays selected distributional statistics implied by the GB2
parameter estimates: quantiles, cumulative shares of total var at
quantiles (i.e. the Lorenz curve ordinates), the mode, mean, standard
deviation, variance, and half the coefficient of variation squared.

poorfrac(#) displays the estimated proportion with values of var less
than the cut-off specified by #. This option may be specified when
replaying results.

cdf(cdfname) creates a new variable cdfname containing the estimated GB2
c.d.f. value F(x) for each x.

pdf(pdfname) creates a new variable pdfname containing the estimated GB2
p.d.f. value f(x) for each x.

robust specifies that the Huber/White/sandwich estimator of variance is
to be used in place of the traditional calculation; see [U] 23.14
Obtaining robust variance estimates.  robust combined with cluster()
allows observations which are not independent within cluster
(although they must be independent between clusters).  If you specify
pweights, robust is implied.

cluster(varname) specifies that the observations are independent across
groups (clusters) but not necessarily within groups.  varname
specifies to which group each observation belongs; e.g.,
cluster(personid) in data with repeated observations on individuals.
See [U] 23.14 Obtaining robust variance estimates. cluster() can be
used with pweights to produce estimates for unstratified
cluster-sampled data.  Specifying cluster() implies robust.

svy indicates that ml is to pick up the svy settings set by svyset and
use the robust variance estimator. Thus, this option requires the
data to be svyset; see help svyset. svy may not be combined with
weights or the strata(), psu(), fpc(), or cluster() options.

level(#) specifies the confidence level, in percent, for the confidence
intervals of the coefficients; see help level.

nolog suppresses the iteration log.

maximize_options control the maximization process. The options available
are those shown by maximize, with the exception of from().  If you
are seeing many "(not concave)" messages in the iteration log, using
the difficult or technique options may help convergence.

svy_options specify the options used together with the svy option.

Saved results

In addition to the usual results saved after ml, gb2fit also saves the
following, if no covariates have been specified:

e(ba), e(bb), e(bp), and e(bq) are the estimated GB2 parameters.

e(cdfvar) and e(pdfvar) are the variable names specified for the c.d.f.
and the p.d.f.

e(mean), e(mean), e(var), e(sd), and e(i2) are the estimated mode, mean,
variance, standard deviation, and half coefficient of variation squared.
e(pX), and e(LpX) are the quantiles, and Lorenz ordinates, where X = {1,
5, 10, 20, 25, 30, 40, 50, 60, 70, 75, 80, 90, 95, 99}.

The following results are saved regardless of whether covariates have
been specified or not.

e(b_a), e(b_b), e(b_p), and e(b_q) are row vectors containing the
parameter estimates from each equation.

e(length_b_a), e(length_b_b), and e(length_b_q) contain the lengths of
these vectors. If no covariates have been specified in an equation, the
corresponding vector has length equal to 1 (the constant term);
otherwise, the length is one plus the number of covariates.

Formulae

The GB2 distribution has distribution function (c.d.f.)

F(x) = ibeta(p, q, (x/b)^a/(1+(x/b)^a) )

where a, b, p, q, are parameters, each positive, for random variable x >
0.  Parameters a, p, and q are the key distributional 'shape' parameters;
b is a scale parameter.

The GB2 distribution has density

f(x) = ax^(ap-1)*{(b^(a*p))*B(p,q)*[1 + (x/b)^a ]^(p+q)}^-1.

The formulae used to derive the distributional summary statistics
presented (optionally) are as follows. The r-th moment about the origin
is given by

b^r*B(p+r/a,q-r/a)/B(p,q)

where B(u,v) is the Beta function = G(u).G(v)/G(u+v) and G(.) is the
gamma function [exp({cmd:{lngamma}(.)]. The moments exist for -ap < r <
aq. By substitution and using G(1) = 1, the moments can be written

b^r*G(p+r/a)*G(q-r/a)/[G(p)G(q)]

and hence

mean = b*G(p+1/a)*G(q-1/a)/[G(p)G(q)]

variance = b*b*G(p+2/a)*G(q-2/a)/[G(p)G(q)] - (mean^2)

from which the standard deviation and half the squared coefficient of
variation can be derived. The mode is

mode = b*((ap-1)/(aq+1))^(1/a) if ap > 1, and 0 otherwise.

The quantiles are derived by inverting the distribution function, and
calculation of the Lorenz ordinates exploits the fact that they follow a
GB2 distribution. (See Kleiber and Kotz, 2003, eqn (6.23).) The Gini
coefficient is not calculated as this requires evaluation of the
generalized hypergeometric function 3F2, and this is not currently
available in Stata.

Examples

. gb2fit x [w=wgt]

. gb2fit

. gb2fit x, a(age sex) b(age sex) p(age sex) q(age sex)

. gb2fit x, abpq(age sex)

. predict double a_i, eq(a) xb

. predict double b_i, eq(b) xb

. predict double p_i, eq(p) xb

. predict double q_i, eq(q) xb

Author

Stephen P. Jenkins <stephenj@essex.ac.uk>, Institute for Social and
Economic Research, University of Essex, Colchester CO4 3SQ, U.K.

Acknowledgements

programs for distributional diagnostic plots (qgb2, pgb2).

References

Dagum, C. (1977). A new model of personal income distribution:
specification and estimation. Economie Appliquée 30: 413-437.

Dagum, C. (1980). The generation and distribution of income, the Lorenz
curve and the Gini ratio. Economie Appliquée 33: 327-367.

Jenkins, S.P. (2004). Fitting functional forms to distributions, using
ml. Presentation at Second German Stata Users Group Meeting, Berlin.
http://www.stata.com/meeting/2german/Jenkins.pdf

Kleiber, C. and Kotz, S. (2003).  Statistical Size Distributions in
Economics and Actuarial Sciences.  Hoboken, NJ: John Wiley.

McDonald, J.B. (1984). Some generalized functions for the size
distribution of income. Econometrica 52: 647-663.

Singh, S.K. and G.S. Maddala (1976). A function for the size distribution
of income. Econometrica 44: 963-970.

Also see

Online: help for gb2pred, qgb2, pgb2, smfit, dagumfit, lognfit, if
installed.

```