{smcl}
{* March 2007}{...}
{hline}
help for {hi:gb2fit}{right:Stephen P. Jenkins (March 2007)}
{hline}

{title:Fitting a Generalized Beta (Second Kind) distribution by ML to unit record data}

{p 8 17 2}{cmd:gb2fit} {it:var} [{it:weight}] [{cmd:if} {it:exp}]
	[{cmd:in} {it:range}] [{cmd:,}
	{cmdab:a:var(}{it:varlist1}{cmd:)} {cmdab:b:var(}{it:varlist2}{cmd:)} 
	{cmdab:p:var(}{it:varlist3}{cmd:)} {cmdab:q:var(}{it:varlist4}{cmd:)}
	{cmdab:abpq(}{it:varlist}{cmd:)} {cmdab:st:ats}  
	{cmdab:f:rom(}{it:string}{cmd:)} {cmdab:poor:frac(}{it:#}{cmd:)} 
	{cmdab:cdf(}{it:cdfname}{cmd:)} {cmdab:pdf(}{it:pdfname}{cmd:)}
	{cmdab:r:obust} {cmdab:cl:uster(}{it:varname}{cmd:)} {cmdab:svy:} 
	{cmdab:l:evel(}{it:#}{cmd:)} {it:maximize_options} {it:svy_options}]

{p 4 4 2}{cmd:by} {it:...} {cmd::} may be used with {cmd:gb2fit}; see help
{help by}. 

{p 4 4 2}{cmd:pweight}s, {cmd:aweight}s, {cmd:fweight}s, and {cmd:iweight}s 
are allowed; see help {help weights}. To use {cmd:pweight}s, you must first 
{cmd:svyset} your data and then use the {cmd:svy} option.

{title:Description}

{p 4 4 2}
{cmd:gb2fit} fits by ML the 4 parameter Generalized Beta distribution 
of the second kind (GB2) to sample observations on a random variable {it:var}. 
Unit record data are assumed (rather than grouped data). The GB2 distribution
is also known as the Generalized F distribution (differently parameterized) or
the Feller-Pareto distribution. The Singh-Maddala (1976) distribution is the 
special case when parameter p = 1, and the Dagum (1977, 1980) distribution is 
the special case when parameter q = 1. For a comprehensive review of these 
and other related distributions, see Kleiber and Kotz (2003). The GB2 distribution 
has been shown to provide a good fit to data on income (see e.g. McDonald, 1984) 
but, of course, it might also be suitable for describing any skewed variable, 
not only income.

{p 4 4 2}
The likelihood function for a sample of observations on {it:var} is specified 
as the product of the densities for each observation (weighted where relevant), and
is maximized using {cmd:ml model lf}. 


{title:Options}

{p 4 8 2}{cmd:avar(}{it:varlist1}{cmd:)}, {cmd:bvar(}{it:varlist2}{cmd:)}, 
	{cmd:pvar(}{it:varlist3}{cmd:)}, and {cmd:qvar(}{it:varlist4}{cmd:)} 
	allow the user to specify each parameter as a function of the covariates 
	specified in the respective variable list. A constant term is always 
	included in each equation.

{p 4 8 2}{cmd:abpq(}{it:varlist}{cmd:)} can be used instead of the previous option 
	if the same covariates are to appear in each parameter equation.

{p 4 8 2}{cmd:from(}{it:string}{cmd:)} specifies initial values for the GB2
	parameters, and is likely to be used only rarely. You can specify the initial 
	values in one of three ways: the name of a vector containing the initial values 
	(e.g., from(b0) where b0 is a properly labeled vector); by specifying coefficient 
	names with the values 	(e.g., from(a:_cons=1 b:_cons=5 p:_cons = 0 q:_cons = .16);
	or by specifying an ordered list of values (e.g., from(1 5 0 .16, copy)).  
	Poor values in from() may lead to convergence problems. For more details, 
	including the use of copy and skip, see {help:maximize}.

{p 8 8 2}If covariates are specified, the next four options are not available. 
	Use {help gb2pred} to generate statistics at particular	values of the 
	covariates, or {cmd:nlcom}. {cmd:predict} can be used to generate the 
	observation-specific parameters corresponding to the covariate values of
 	each sample observation: see Examples below.

{p 4 8 2}{cmd:stats} displays selected distributional statistics implied by the
	GB2 parameter estimates: quantiles, cumulative shares of total {it:var} at quantiles 
	(i.e. the Lorenz curve ordinates), the mode, mean, standard deviation, variance, 
	and half the coefficient of variation squared. 

{p 4 8 2}{cmd:poorfrac(}{it:#}{cmd:)} displays the estimated proportion with values of {it:var} 
	less than the cut-off specified by {it:#}. This option may be specified when replaying
	results.

{p 4 8 2}{cmd:cdf(}{it:cdfname}{cmd:)} creates a new variable {it:cdfname} containing the
	estimated GB2 c.d.f. value F(x) for each x.

{p 4 8 2}{cmd:pdf(}{it:pdfname}{cmd:)} creates a new variable {it:pdfname} containing the
	estimated GB2 p.d.f. value f(x) for each x.

{p 4 8 2}{cmd:robust} specifies that the Huber/White/sandwich estimator of
variance is to be used in place of the traditional calculation; see
{hi:[U] 23.14 Obtaining robust variance estimates}.  {cmd:robust} combined
with {cmd:cluster()} allows observations which are not independent within
cluster (although they must be independent between clusters).  If you 
specify {help pweight}s, {cmd:robust} is implied.

{p 4 8 2}{cmd:cluster(}{it:varname}{cmd:)} specifies that the observations are
independent across groups (clusters) but not necessarily within groups.
{it:varname} specifies to which group each observation belongs; e.g.,
{cmd:cluster(personid)} in data with repeated observations on individuals. 
See {hi:[U] 23.14 Obtaining robust variance estimates}. {cmd:cluster()} can be
used with {help pweight}s to produce estimates for unstratified
cluster-sampled data.  Specifying {cmd:cluster()} implies {cmd:robust}.

{p 4 8 2}{cmd:svy} indicates that {cmd:ml} is to pick up the {cmd:svy} settings 
set by {cmd:svyset} and use the robust variance estimator. Thus, this option 
requires the data to be {cmd:svyset}; see help {help svyset}. {cmd:svy} may not be 
combined with weights or the {cmd:strata()}, {cmd:psu()}, {cmd:fpc()}, or 
{cmd:cluster()} options.

{p 4 8 2}{cmd:level(}{it:#}{cmd:)} specifies the confidence level, in percent,
for the confidence intervals of the coefficients; see help {help level}.

{p 4 8 2}{cmd:nolog} suppresses the iteration log.

{p 4 8 2}{it:maximize_options} control the maximization process. The options
available are those shown by {help maximize}, with the exception of {cmd:from()}.
If you are seeing many "(not concave)" messages in the iteration
log, using the {cmd:difficult} or {cmd:technique} options may help convergence.

{p 4 8 2}{it:svy_options} specify the options used together with the {cmd:svy} option.


{title:Saved results}

{p 4 4 2}In addition to the usual results saved after {cmd:ml}, {cmd:gb2fit} also
saves the following, if no covariates have been specified:

{p 4 4 2}{cmd:e(ba)}, {cmd:e(bb)}, {cmd:e(bp)}, and {cmd:e(bq)} are the estimated GB2
parameters.

{p 4 4 2}{cmd:e(cdfvar)} and {cmd:e(pdfvar)} are the variable names specified for the
c.d.f. and the p.d.f.

{p 4 4 2}{cmd:e(mean)}, {cmd:e(mean)}, {cmd:e(var)}, {cmd:e(sd)}, and {cmd:e(i2)}
are the estimated mode, mean, variance, standard deviation, and half coefficient of 
variation squared.  {cmd:e(pX)}, and {cmd:e(LpX)} are the quantiles, and Lorenz 
ordinates, where X = {1, 5, 10, 20, 25, 30, 40, 50, 60, 70, 75, 80, 90, 95, 99}. 


{p 4 4 2}The following results are saved regardless of whether covariates have been 
specified or not.

{p 4 4 2}{cmd:e(b_a)}, {cmd:e(b_b)}, {cmd:e(b_p)}, and {cmd:e(b_q)} 
are row vectors containing the parameter estimates from each equation. 

{p 4 4 2}{cmd:e(length_b_a)}, {cmd:e(length_b_b)}, and {cmd:e(length_b_q)} contain
the lengths of these vectors. If no covariates have been specified in an equation, 
the corresponding vector has length equal to 1 (the constant term); 
otherwise, the length is one plus the number of covariates.


{title:Formulae}

{p 4 4 2}
The GB2 distribution has distribution function (c.d.f.)

{p 8 8 2}
	F(x) = {cmd:ibeta}(p, q, (x/b)^a/(1+(x/b)^a) )

{p 4 4 2}
where a, b, p, q, are parameters, each positive, for random variable x > 0. 
Parameters a, p, and q are the key distributional 'shape' parameters; b is a scale parameter.

{p 4 4 2}
The GB2 distribution has density

{p 8 8 2}		         
	f(x) = ax^(ap-1)*{(b^(a*p))*B(p,q)*[1 + (x/b)^a ]^(p+q)}^-1.

{p 4 4 2}
The formulae used to derive the distributional summary statistics 
presented (optionally) are as follows. The r-th moment about the origin
is given by

{p 8 8 2}
	b^r*B(p+r/a,q-r/a)/B(p,q)

{p 4 4 2}
where B(u,v) is the Beta function = G(u).G(v)/G(u+v) and G(.) is the 
gamma function [exp({cmd:{lngamma}(.)]. The moments exist for 
-ap < r < aq. By substitution and using G(1) = 1, the moments 
can be written

{p 8 8 2}
	b^r*G(p+r/a)*G(q-r/a)/[G(p)G(q)]

{p 4 4 2}
and hence

{p 8 8 2} 
      	mean = b*G(p+1/a)*G(q-1/a)/[G(p)G(q)]

{p 8 8 2}
      	variance = b*b*G(p+2/a)*G(q-2/a)/[G(p)G(q)] - (mean^2) 

{p 4 4 2}
from which the standard deviation and half the squared coefficient of 
variation can be derived. The mode is

{p 4 4 2}
	mode = b*((ap-1)/(aq+1))^(1/a) if ap > 1, and 0 otherwise.

{p 4 4 2} 
The quantiles are derived by inverting the distribution function, and
calculation of the Lorenz ordinates exploits the fact that they follow 
a GB2 distribution. (See Kleiber and Kotz, 2003, eqn (6.23).) The Gini coefficient 
is not calculated as this requires evaluation of the generalized hypergeometric 
function 3{it:F}2, and this is not currently available in Stata. 

	
{title:Examples}

{p 4 8 2}{inp:. gb2fit x [w=wgt] }

{p 4 8 2}{inp:. gb2fit }

{p 4 8 2}{inp:. gb2fit x, a(age sex) b(age sex) p(age sex) q(age sex) }

{p 4 8 2}{inp:. gb2fit x, abpq(age sex) }

{p 4 8 2}{inp:. predict double a_i,  eq(a) xb }

{p 4 8 2}{inp:. predict double b_i,  eq(b) xb }

{p 4 8 2}{inp:. predict double p_i,  eq(p) xb }

{p 4 8 2}{inp:. predict double q_i,  eq(q) xb }

{p 4 4 2}See also the examples in the presentation by 
{browse "http://www.stata.com/meeting/2german/Jenkins.pdf":Jenkins (2004)}.

{title:Author}

{p 4 4 2}Stephen P. Jenkins <stephenj@essex.ac.uk>, Institute for Social
and Economic Research, University of Essex, Colchester CO4 3SQ, U.K.

{title:Acknowledgements}

{p 4 4 2}N.J. Cox made numerous helpful comments and suggestions, and also wrote
programs for distributional diagnostic plots ({help qgb2}, {help pgb2}).


{title:References}

{p 4 8 2}Dagum, C. (1977). A new model of personal income distribution:
	specification and estimation. {it:Economie Appliqu{c e'}e} 30: 413-437.

{p 4 8 2}Dagum, C. (1980). The generation and distribution of income, the
	Lorenz curve and the Gini ratio. {it:Economie Appliqu{c e'}e} 33: 327-367.

{p 4 8 2}Jenkins, S.P. (2004). Fitting functional forms to distributions, using {cmd:ml}. Presentation
	at Second German Stata Users Group Meeting, Berlin. {browse "http://www.stata.com/meeting/2german/Jenkins.pdf"}

{p 4 8 2}Kleiber, C. and Kotz, S. (2003). 
	{it:Statistical Size Distributions in Economics and Actuarial Sciences}. 
	Hoboken, NJ: John Wiley.

{p 4 8 2}McDonald, J.B. (1984). Some generalized functions for the size
	distribution of income. {it:Econometrica} 52: 647-663.

{p 4 8 2}Singh, S.K. and G.S. Maddala (1976). A function for the size
	distribution of income. {it:Econometrica} 44: 963-970.


{title:Also see}

{p 4 13 2}
Online: help for {help gb2pred}, {help qgb2}, {help pgb2}, {help smfit}, 
{help dagumfit}, {help lognfit}, if installed.