{smcl}
{* October 2007}{...}
{hline}
help for {hi:gb2lfit}{right:Stephen P. Jenkins (October 2007)}
{hline}
{title:Fitting a Generalized Beta (Second Kind) distribution by ML to unit record data}
{p 8 17 2}{cmd:gb2lfit} {it:var} [{it:weight}] [{cmd:if} {it:exp}]
[{cmd:in} {it:range}] [{cmd:,} {cmdab:cens:var(}{it:varname}{cmd:)}
{cmdab:left:tr(}{it:varname}{cmd:)}
{cmdab:a:var(}{it:varlist1}{cmd:)} {cmdab:b:var(}{it:varlist2}{cmd:)}
{cmdab:p:var(}{it:varlist3}{cmd:)} {cmdab:q:var(}{it:varlist4}{cmd:)}
{cmdab:abpq(}{it:varlist}{cmd:)} {cmdab:st:ats}
{cmdab:ex:tras} {cmdab:igini} {cmdab:nips(}#{cmd:)}
{cmdab:cgini} {cmdab:eps:ilon(}#{cmd:)}
{cmdab:f:rom(}{it:string}{cmd:)} {cmdab:poor:frac(}{it:#}{cmd:)}
{cmdab:cdf(}{it:cdfname}{cmd:)} {cmdab:pdf(}{it:pdfname}{cmd:)}
{cmdab:r:obust} {cmdab:cl:uster(}{it:varname}{cmd:)} {cmdab:svy:}
{cmdab:l:evel(}{it:#}{cmd:)} {it:maximize_options} {it:svy_options}]
{p 4 4 2}{cmd:by} {it:...} {cmd::} may be used with {cmd:gb2lfit}; see help
{help by}.
{p 4 4 2}{cmd:pweight}s, {cmd:aweight}s, {cmd:fweight}s, and {cmd:iweight}s
are allowed; see help {help weights}. To use {cmd:pweight}s, you must first
{cmd:svyset} your data and then use the {cmd:svy} option.
{title:Description}
{p 4 4 2}
{cmd:gb2lfit} fits by ML the 4 parameter Generalized Beta distribution
of the Second Kind (GB2) to sample observations on a random variable {it:var}.
Unit record data are assumed (rather than grouped data). The GB2 distribution is
also known as the Generalized F distribution (in a different parameterization) or
the Feller-Pareto distribution. The Singh-Maddala (1976) distribution is the
special case when parameter p = 1, and the Dagum (1977, 1980) distribution
is the special case when parameter q = 1. For a comprehensive review
of these and other related distributions, see Kleiber and Kotz (2003).
The GB2 distribution has been shown to provide a good fit to data on income (see
e.g. McDonald, 1984) but, of course, it might also be suitable for describing any
skewed variable, not only income. Compared to {cmd:gb2fit}, {cmd:gb2lfit} has the
following features: (i) parameters are estimated in the logarithmic metric (but
reported in the original metric); (ii) data may be right-censored or left-truncated;
(iii) there are improved options for calculating inequality measures implied by
the parameter estimates (with SEs); (iv) in particular, estimates of the Gini coefficient
are reported.
{p 4 4 2}
The likelihood contribution for an observation with a (non-censored) value
of {it:var} is given by the GB2 density, and for an observation with a right-censored
(`top coded') value of {it:var} is given by 1 minus the GB2 cdf. See Feng {it:et al.}
(2006, p. 60). If data are left-truncated, likelihood contributions are appropriately
conditioned on survival up to the truncation point. The sample likelihood is maximized
using {cmd:ml model lf}. The GB2 parameters are estimated in a logarithmic metric to
ensure that each is positive; results are displayed in the logarithmic and natural metric.
{title:Options}
{p 4 8 2}{cmd:censvar(}{it:varname}{cmd:)} is used to identify observations with
right-censored (`top coded') values of {it:var}. Existing variable
{it:varname} must equal 0 for non-censored observations, and 1 for
right-censored observations.
{p 4 8 2}{cmd:lefttr(}{it:varname}{cmd:)} is used when estimation is based on
left-truncated data: data are only available if values of {it:var} are greater than
the value given by the variable labelled {it:varname} in this option. For example,
income can range from zero to some maximum value in principle but a particular survey
might by design only contain observations with values greater than a particular value.
In survival analysis applications, left truncation is also known as 'delayed entry'.
{p 4 8 2}{cmd:avar(}{it:varlist1}{cmd:)}, {cmd:bvar(}{it:varlist2}{cmd:)},
{cmd:pvar(}{it:varlist3}{cmd:)}, and {cmd:qvar(}{it:varlist4}{cmd:)}
allow the user to specify (the log of) each parameter as a function of
the covariates specified in the respective variable list. A constant
term is always included in each equation.
{p 4 8 2}{cmd:abpq(}{it:varlist}{cmd:)} can be used instead of the previous
option if the same covariates are to appear in each parameter equation.
{p 4 8 2}{cmd:from(}{it:string}{cmd:)} specifies initial values for the (log of the) GB2
parameters, and is likely to be used only rarely. You can specify the initial
values in one of three ways: the name of a vector containing the initial values
(e.g., from(b0) where b0 is a properly labeled vector); by specifying coefficient
names with the values (e.g., from(ln_a:_cons=1 ln_b:_cons=5 ln_p:_cons = 0 ln_q:_cons = .16);
or by specifying an ordered list of values (e.g., from(1 5 0 .16, copy)).
Poor values in from() may lead to convergence problems. For more details,
including the use of copy and skip, see {help maximize}.
{p 8 8 2}If covariates are specified, the next ten options are not available.
If covariates are specified, use {help gb2pred} or {cmd:nlcom} to generate
at particular covariate values statistics corresponding to those produced
by the {cmd:stats} option below. {cmd:predictnl} can be used to generate the
observation-specific parameters corresponding to the covariate values of
each sample observation
{p 4 8 2}{cmd:stats} displays selected distributional statistics implied by the
GB2 parameter estimates: selected quantiles, cumulative shares of total {it:var}
at those quantiles (i.e. the Lorenz curve ordinates), the mode, mean, standard deviation,
variance, and half the coefficient of variation squared.
{p 4 8 2}{cmd:extras} displays selected statistics implied by the GB2 parameter estimates,
together with estimates of their standard errors, including: selected quantiles; Lorenz curve
ordinates at those quantiles; the ratio of the 75th to the 25th percentile, and of the 90th to the 10th
percentile; the shares of the richest 1%, 5% and 10%; the mean, standard deviation, variance;
and generalized entropy indices GE(a) for a = -1, 0, 1, 2. GE(0) is the mean
logarithmic deviation, GE(1) is the Theil index, and GE(2) is half the coefficient
of variation squared. Standard errors are derived using the delta method.
{p 4 8 2}{cmd:cgini} displays the Gini coefficient implied by the GB2 parameter estimates,
together with an estimate of its standard error derived using the delta method.
Computation uses the McDonald (1984) formula based on 3F2 generalized hypergeometric functions.
Calculation of the Gini using this option takes substantially (sometimes prohibitively)
longer than calculation using the {cmd:igini} option.
{p 4 8 2}{cmd:epsilon(}#{cmd:)} allows users to change the convergence criterion used when
calculating the Gini coefficient using the {cmd:cgini} option. In the GB2 case,
the Gini is a function of two 3F2 generalized hypergeometric functions, each of which
is an infinite series. Convergence of the series is slower, the smaller that a*q is
(the series is non-convergent if a*q <= 1). By default, calculation stops when the largest
change between iterations in the calculations of the two sums used to evaluate the 3F2 functions
is less than 1e-10. Users are advised to vary the choice of epsilon to check the
accuracy of the estimate of the Gini coefficient.
{p 4 8 2}{cmd:igini} displays the Gini coefficient implied by the GB2 parameter estimates,
together with an estimate of its standard error derived using the delta method.
Computation uses numerical integration over the range [0,1], i.e. over the range of the
cumulative distribution function for {it:var}.
{p 4 8 2}{cmd:nips(}#{cmd:)} sets the number of integration points used when the {cmd:igini} option
is used. The default value is 5000. Increasing the number of integration points, other things equal,
will increase accuracy of the calculation, but at a decreasing rate. Users are advised to vary the
number of integration points in order to check that the estimate of the Gini coefficient
is sufficiently accurate.
{p 4 8 2}{cmd:poorfrac(}{it:#}{cmd:)} displays the estimated proportion with values of {it:var}
less than the cut-off specified by {it:#}. This option may be specified when replaying
results.
{p 4 8 2}{cmd:cdf(}{it:cdfname}{cmd:)} creates a new variable {it:cdfname} containing the
estimated GB2 c.d.f. value F(x) for each x.
{p 4 8 2}{cmd:pdf(}{it:pdfname}{cmd:)} creates a new variable {it:pdfname} containing the
estimated GB2 p.d.f. value f(x) for each x.
{p 4 8 2}{cmd:robust} specifies that the Huber/White/sandwich estimator of
variance is to be used in place of the traditional calculation; see
{hi:[U] 23.14 Obtaining robust variance estimates}. {cmd:robust} combined
with {cmd:cluster()} allows observations which are not independent within
cluster (although they must be independent between clusters). If you
specify {help pweight}s, {cmd:robust} is implied.
{p 4 8 2}{cmd:cluster(}{it:varname}{cmd:)} specifies that the observations are
independent across groups (clusters) but not necessarily within groups.
{it:varname} specifies to which group each observation belongs; e.g.,
{cmd:cluster(personid)} in data with repeated observations on individuals.
See {hi:[U] 23.14 Obtaining robust variance estimates}. {cmd:cluster()} can be
used with {help pweight}s to produce estimates for unstratified
cluster-sampled data. Specifying {cmd:cluster()} implies {cmd:robust}.
{p 4 8 2}{cmd:svy} indicates that {cmd:ml} is to pick up the {cmd:svy} settings
set by {cmd:svyset} and use the robust variance estimator. Thus, this option
requires the data to be {cmd:svyset}; see help {help svyset}. {cmd:svy} may not be
combined with weights or the {cmd:strata()}, {cmd:psu()}, {cmd:fpc()}, or
{cmd:cluster()} options.
{p 4 8 2}{cmd:level(}{it:#}{cmd:)} specifies the confidence level, in percent,
for the confidence intervals of the coefficients; see help {help level}.
{p 4 8 2}{cmd:nolog} suppresses the iteration log.
{p 4 8 2}{it:maximize_options} control the maximization process. The options
available are those shown by {help maximize}, with the exception of {cmd:from()}.
If you are seeing many "(not concave)" messages in the log, using the
{cmd:difficult} or {cmd:technique} options may help convergence.
{p 4 8 2}{it:svy_options} specify the options used together with the {cmd:svy} option.
{title:Saved results}
{p 4 4 2}In addition to the usual results saved after {cmd:ml}, {cmd:gb2lfit} also
saves the following, if no covariates have been specified:
{p 4 4 2}{cmd:e(ba)}, {cmd:e(bb)}, {cmd:e(bp)}, and {cmd:e(bq)} are the estimated GB2
parameters. The corresponding estimated standard errors are saved in {cmd:e(se_ba)},
{cmd:e(se_bb)}, {cmd:e(se_bp)}, and {cmd:e(se_bq)}.
{p 4 4 2}{cmd:e(cdfvar)} and {cmd:e(pdfvar)} are the variable names specified for the
c.d.f. and the p.d.f.
{p 4 4 2}After use of the {cmd:stats} option: {cmd:e(mean)}, {cmd:e(mean)}, {cmd:e(var)},
{cmd:e(sd)}, and {cmd:e(i2)} are the estimated mode, mean, variance, standard deviation,
and half coefficient of variation squared. {cmd:e(pX)} and {cmd:e(LpX)} are the quantiles
and Lorenz ordinates, where X = {1, 5, 10, 20, 25, 30, 40, 50, 60, 70, 75, 80, 90, 95, 99}.
{p 4 4 2}After use of the {cmd:extras} option: {cmd:e(aXp)}, {cmd:e(aXq)} are products of
parameters a*p and a*q; {cmd:e(cmean)}, {cmd:e(cvar)}, and {cmd:e(csd)} are the estimated
mean, variance, and standard deviation; {cmd:e(cshtop01)}, {cmd:e(cshtop05)}, and
{cmd:e(cshtop10)}, are the shares of the richest 1%, 5%, and 10%; {cmd:e(cpX)} and
{cmd:e(LpX)} are the quantiles and Lorenz ordinates, where
X = {1, 5, 10, 20, 25, 30, 40, 50, 60, 70, 75, 80, 90, 95, 99}; {cmd:e(cp75p25)} and
{cmd:e(cp90p10)} are the ratios of the 75th percentile to the 25th percentile, and
of the 90th percentile to the 10th percentile; {cmd:e(GEm1)}, {cmd:e(GE0)}, {cmd:e(GE1)},
and {cmd:e(GE2)} are generalized entropy inequality indices GE(a) for a = -1, 0, 1, and 2,
respectively. Standard errors for the statistics are also saved, in macros with
"se_" prefixes, e.g. {cmd:e(se_cp90p10)} contains the estimated standard error of the
p90:p10 ratio.
{p 4 4 2}After use of the {cmd:cgini} option: {cmd:e(cgini)} is the estimate of the Gini, and
{cmd:e(se_cgini)} is the corresponding standard error. The number of iterations used to
evaluate the formulae for the Gini is saved in {cmd:e(num_iter)}.
{p 4 4 2}After use of the {cmd:igini} option: {cmd:e(igini)} is the estimate of the Gini, and
{cmd:e(se_igini)} is the corresponding standard error. The value of the upper integration point
is saved in {cmd:e(uip)} and the step size is saved in {cmd:e(step)}.
{p 4 4 2}The following results are saved regardless of whether covariates have been
specified or not.
{p 4 4 2}{cmd:e(b_lna)}, {cmd:e(b_lnb)}, {cmd:e(b_lnp)}, and {cmd:e(b_lnq)}
are row vectors containing the parameter estimates from each equation.
{p 4 4 2}{cmd:e(length_b_lna)}, {cmd:e(length_b_lnb)}, {cmd:e(length_b_lnp)}, and
{cmd:e(length_b_lnq)} contain the lengths of these vectors. If no covariates
have been specified in an equation, the corresponding vector has length
equal to 1 (the constant term); otherwise, the length is one plus the
number of covariates.
{title:Formulae}
{p 4 4 2}
The GB2 distribution has distribution function (c.d.f.)
{p 8 8 2}
F(x) = {cmd:ibeta}(p, q, (x/b)^a/(1+(x/b)^a) )
{p 4 4 2}
where a, b, p, q, are parameters, each positive, for random variable x > 0.
Parameters a, p, and q are the key distributional 'shape' parameters; b is a scale parameter.
{p 4 4 2}
The GB2 distribution has density
{p 8 8 2}
f(x) = ax^(ap-1)*{(b^(a*p))*B(p,q)*[1 + (x/b)^a ]^(p+q)}^-1.
{p 4 4 2}
The formulae used to derive the distributional summary statistics
presented (optionally) are as follows. The r-th moment about the origin
is given by
{p 8 8 2}
b^r*B(p+r/a,q-r/a)/B(p,q)
{p 4 4 2}
where B(u,v) is the Beta function = G(u)*G(v)/G(u+v) and G(.) is the
gamma function [exp({cmd:lngamma}(.)]. The moments exist for -ap < r < aq.
By substitution and using G(1) = 1, the moments can be written
{p 8 8 2}
b^r*G(p+r/a)*G(q-r/a)/[G(p)G(q)]
{p 4 4 2}
and hence
{p 8 8 2}
mean = b*G(p+1/a)*G(q-1/a)/[G(p)G(q)]
{p 8 8 2}
variance = b*b*G(p+2/a)*G(q-2/a)/[G(p)G(q)] - (mean^2)
{p 4 4 2}
from which the standard deviation and half the squared coefficient of
variation can be derived. The mode is
{p 4 4 2}
mode = b*((ap-1)/(aq+1))^(1/a) if ap > 1, and 0 otherwise.
{p 4 4 2}
The quantiles are derived by inverting the distribution function, and
calculation of the Lorenz ordinates exploits the fact that they follow
a GB2 distribution. (See Kleiber and Kotz, 2003, eqn (6.23).) The Gini coefficient
is a function of 3F2 generalized hypergeometric functions: see McDonald (1984).
Expressions for the generalized entropy indices of inequality are derived for the
GB2 case by Jenkins (2009).
{title:Examples}
{p 4 8 2}{inp:. gb2lfit x [w=wgt], stats }
{p 4 8 2}{inp:. gb2lfit }
{p 4 8 2}{inp:. matrix b1 = (3, 200, 1.2, 1.4) }
{p 4 8 2}{inp:. matrix colnames b1 = ln_a:_cons ln_b:_cons ln_p:_cons ln_q:_cons }
{p 4 8 2}{inp:. gb2lfit x [w=wgt], from(b1) }
{p 4 8 2}{inp:. gb2lfit x [w=wgt], from(b1) extras }
{p 4 8 2}{inp:. gb2lfit x [w=wgt], stats extras igini}
{p 4 8 2}{inp:. gb2lfit x [w=wgt], cgini epsilon(1e-8) }
{title:Author}
{p 4 4 2}Stephen P. Jenkins
{title:Acknowledgements}
{p 4 4 2}N.J. Cox made numerous helpful comments and suggestions, and also wrote
programs for distributional diagnostic plots ({help qgb2}, {help pgb2}).
{title:References}
{p 4 8 2}Dagum, C. 1977. A new model of personal income distribution:
specification and estimation. {it:Economie Appliqu{c e'}e} 30: 413{c -}437.
{p 4 8 2}Dagum, C. 1980. The generation and distribution of income, the
Lorenz curve and the Gini ratio. {it:Economie Appliqu{c e'}e} 33: 327{c -}367.
{p 4 8 2}Feng, S., Burkhauser, R.V. and Butler, J.S. 2006. Levels and long-term
trends in earnings inequality: overcoming Current Population Survey
censoring problems using the GB2 distribution.
{it:Journal of Business & Economic Statistics} 24: 51{c -}62.
{p 4 8 2}Jenkins, S.P. 2009. Distributionally-sensitive inequality indices and the GB2 income distribution.
{it:Review of Income and Wealth} 55: 392{c -}398.
{p 4 8 2}Kleiber, C. and Kotz, S. 2003.
{it:Statistical Size Distributions in Economics and Actuarial Sciences}.
Hoboken, NJ: John Wiley.
{p 4 8 2}McDonald, J.B. 1984. Some generalized functions for the size
distribution of income. {it:Econometrica} 52: 647{c -}663.
{p 4 8 2}Singh, S.K. and G.S. Maddala 1976. A function for the size
distribution of income. {it:Econometrica} 44: 963{c -}970.
{title:Also see}
{p 4 13 2}
Online: help for {help gb2fit} {help gb2pred}, {help qgb2}, {help pgb2}, {help smfit},
{help dagumfit}, {help lognfit}, if installed.