------------------------------------------------------------------------------- help forgb2fitStephen P. Jenkins (March 2007) -------------------------------------------------------------------------------

Fitting a Generalized Beta (Second Kind) distribution by ML to unit record data

gb2fitvar[weight] [ifexp] [inrange] [,avar(varlist1)bvar(varlist2)pvar(varlist3)qvar(varlist4)abpq(varlist)statsfrom(string)poorfrac(#)cdf(cdfname)pdf(pdfname)robustcluster(varname)svylevel(#)maximize_optionssvy_options]

by...:may be used withgb2fit; see help by.

pweights,aweights,fweights, andiweights are allowed; see help weights. To usepweights, you must firstsvysetyour data and then use thesvyoption.

Description

gb2fitfits by ML the 4 parameter Generalized Beta distribution of the second kind (GB2) to sample observations on a random variablevar. Unit record data are assumed (rather than grouped data). The GB2 distribution is also known as the Generalized F distribution (differently parameterized) or the Feller-Pareto distribution. The Singh-Maddala (1976) distribution is the special case when parameter p = 1, and the Dagum (1977, 1980) distribution is the special case when parameter q = 1. For a comprehensive review of these and other related distributions, see Kleiber and Kotz (2003). The GB2 distribution has been shown to provide a good fit to data on income (see e.g. McDonald, 1984) but, of course, it might also be suitable for describing any skewed variable, not only income.The likelihood function for a sample of observations on

varis specified as the product of the densities for each observation (weighted where relevant), and is maximized usingml model lf.

Options

avar(varlist1),bvar(varlist2),pvar(varlist3), andqvar(varlist4)allow the user to specify each parameter as a function of the covariates specified in the respective variable list. A constant term is always included in each equation.

abpq(varlist)can be used instead of the previous option if the same covariates are to appear in each parameter equation.

from(string)specifies initial values for the GB2 parameters, and is likely to be used only rarely. You can specify the initial values in one of three ways: the name of a vector containing the initial values (e.g., from(b0) where b0 is a properly labeled vector); by specifying coefficient names with the values (e.g., from(a:_cons=1 b:_cons=5 p:_cons = 0 q:_cons = .16); or by specifying an ordered list of values (e.g., from(1 5 0 .16, copy)). Poor values in from() may lead to convergence problems. For more details, including the use of copy and skip, see {help:maximize}.If covariates are specified, the next four options are not available. Use gb2pred to generate statistics at particular values of the covariates, or

nlcom.predictcan be used to generate the observation-specific parameters corresponding to the covariate values of each sample observation: see Examples below.

statsdisplays selected distributional statistics implied by the GB2 parameter estimates: quantiles, cumulative shares of totalvarat quantiles (i.e. the Lorenz curve ordinates), the mode, mean, standard deviation, variance, and half the coefficient of variation squared.

poorfrac(#)displays the estimated proportion with values ofvarless than the cut-off specified by#. This option may be specified when replaying results.

cdf(cdfname)creates a new variablecdfnamecontaining the estimated GB2 c.d.f. value F(x) for each x.

pdf(pdfname)creates a new variablepdfnamecontaining the estimated GB2 p.d.f. value f(x) for each x.

robustspecifies that the Huber/White/sandwich estimator of variance is to be used in place of the traditional calculation; see[U] 23.14Obtaining robust variance estimates.robustcombined withcluster()allows observations which are not independent within cluster (although they must be independent between clusters). If you specify pweights,robustis implied.

cluster(varname)specifies that the observations are independent across groups (clusters) but not necessarily within groups.varnamespecifies to which group each observation belongs; e.g.,cluster(personid)in data with repeated observations on individuals. See[U] 23.14 Obtaining robust variance estimates.cluster()can be used with pweights to produce estimates for unstratified cluster-sampled data. Specifyingcluster()impliesrobust.

svyindicates thatmlis to pick up thesvysettings set bysvysetand use the robust variance estimator. Thus, this option requires the data to besvyset; see help svyset.svymay not be combined with weights or thestrata(),psu(),fpc(), orcluster()options.

level(#)specifies the confidence level, in percent, for the confidence intervals of the coefficients; see help level.

nologsuppresses the iteration log.

maximize_optionscontrol the maximization process. The options available are those shown by maximize, with the exception offrom(). If you are seeing many "(not concave)" messages in the iteration log, using thedifficultortechniqueoptions may help convergence.

svy_optionsspecify the options used together with thesvyoption.

Saved resultsIn addition to the usual results saved after

ml,gb2fitalso saves the following, if no covariates have been specified:

e(ba),e(bb),e(bp), ande(bq)are the estimated GB2 parameters.

e(cdfvar)ande(pdfvar)are the variable names specified for the c.d.f. and the p.d.f.

e(mean),e(mean),e(var),e(sd), ande(i2)are the estimated mode, mean, variance, standard deviation, and half coefficient of variation squared.e(pX), ande(LpX)are the quantiles, and Lorenz ordinates, where X = {1, 5, 10, 20, 25, 30, 40, 50, 60, 70, 75, 80, 90, 95, 99}.

The following results are saved regardless of whether covariates have been specified or not.

e(b_a),e(b_b),e(b_p), ande(b_q)are row vectors containing the parameter estimates from each equation.

e(length_b_a),e(length_b_b), ande(length_b_q)contain the lengths of these vectors. If no covariates have been specified in an equation, the corresponding vector has length equal to 1 (the constant term); otherwise, the length is one plus the number of covariates.

FormulaeThe GB2 distribution has distribution function (c.d.f.)

F(x) =

ibeta(p, q, (x/b)^a/(1+(x/b)^a) )where a, b, p, q, are parameters, each positive, for random variable x > 0. Parameters a, p, and q are the key distributional 'shape' parameters; b is a scale parameter.

The GB2 distribution has density

f(x) = ax^(ap-1)*{(b^(a*p))*B(p,q)*[1 + (x/b)^a ]^(p+q)}^-1.

The formulae used to derive the distributional summary statistics presented (optionally) are as follows. The r-th moment about the origin is given by

b^r*B(p+r/a,q-r/a)/B(p,q)

where B(u,v) is the Beta function = G(u).G(v)/G(u+v) and G(.) is the gamma function [exp({cmd:{lngamma}(.)]. The moments exist for -ap < r < aq. By substitution and using G(1) = 1, the moments can be written

b^r*G(p+r/a)*G(q-r/a)/[G(p)G(q)]

and hence

mean = b*G(p+1/a)*G(q-1/a)/[G(p)G(q)]

variance = b*b*G(p+2/a)*G(q-2/a)/[G(p)G(q)] - (mean^2)

from which the standard deviation and half the squared coefficient of variation can be derived. The mode is

mode = b*((ap-1)/(aq+1))^(1/a) if ap > 1, and 0 otherwise.

The quantiles are derived by inverting the distribution function, and calculation of the Lorenz ordinates exploits the fact that they follow a GB2 distribution. (See Kleiber and Kotz, 2003, eqn (6.23).) The Gini coefficient is not calculated as this requires evaluation of the generalized hypergeometric function 3

F2, and this is not currently available in Stata.

Examples. gb2fit x [w=wgt]

. gb2fit

. gb2fit x, a(age sex) b(age sex) p(age sex) q(age sex)

. gb2fit x, abpq(age sex)

. predict double a_i, eq(a) xb

. predict double b_i, eq(b) xb

. predict double p_i, eq(p) xb

. predict double q_i, eq(q) xb

See also the examples in the presentation by Jenkins (2004).

AuthorStephen P. Jenkins <stephenj@essex.ac.uk>, Institute for Social and Economic Research, University of Essex, Colchester CO4 3SQ, U.K.

AcknowledgementsN.J. Cox made numerous helpful comments and suggestions, and also wrote programs for distributional diagnostic plots (qgb2, pgb2).

ReferencesDagum, C. (1977). A new model of personal income distribution: specification and estimation.

Economie Appliquée30: 413-437.Dagum, C. (1980). The generation and distribution of income, the Lorenz curve and the Gini ratio.

Economie Appliquée33: 327-367.Jenkins, S.P. (2004). Fitting functional forms to distributions, using

ml. Presentation at Second German Stata Users Group Meeting, Berlin. http://www.stata.com/meeting/2german/Jenkins.pdfKleiber, C. and Kotz, S. (2003).

Statistical Size Distributions inEconomics and Actuarial Sciences. Hoboken, NJ: John Wiley.McDonald, J.B. (1984). Some generalized functions for the size distribution of income.

Econometrica52: 647-663.Singh, S.K. and G.S. Maddala (1976). A function for the size distribution of income.

Econometrica44: 963-970.

Also seeOnline: help for gb2pred, qgb2, pgb2, smfit, dagumfit, lognfit, if installed.