------------------------------------------------------------------------------- help forparetofitStephen P. Jenkins & Philippe Van Kerm (April 2007) -------------------------------------------------------------------------------

Fitting a Pareto (Type I) distribution by ML to unit record data

paretofitvar[weight] [ifexp] [inrange] [,avar(varlist)cdf(cdfname)pdf(pdfname)robustcluster(varname)from(#)level(#)maximize_options]

byandsvyprefixes are allowed (but not jointly); see prefix.

fweights,aweights,pweights, andiweights are allowed; see weight.

Description

paretofitfits by ML a Pareto (Type I) distribution to sample observations on a random variablevar. Unit record data are assumed (rather than grouped data).The Pareto distribution is named after the Italian economist Vilfredo Pareto (1848-1923). It is one of the most famous and widely studied statistical size distributions. It is well-known for approximating wealth distributions, but has applications in many different fields (e.g. for size distibutions of human setllements, sand particles, word frequencies, or for assessing portfolio risk). See Kleiber and Kotz (2003) for a comprehensive review of the Pareto (and other) distributions.

The likelihood function for a sample of observations on

varis specified as the product of the densities for each observation (weighted where relevant), and is maximized usingml model lf. A closed-form expression for the ML estimator of the Pareto Type I shape parameter is readily available but estimation withml modelallows us to accommodate various sample design easily, as well as inclusion of covariates.

Options

x0(scalar)specifies the scale parameter of the Pareto distribution (see formula below). The Pareto distribution is fitted only to sample observations wherevar>=x0. By default,x0is set to the minimum value ofvar(within the sub-sample identified byifandinclauses).

avar(varlist)allows the user to specify the shape parameter of the distribution as a function of the covariates specified invarlist. A constant term is always included.

statsdisplays selected distributional statistics implied by the Pareto parameter estimate: quantiles, cumulative shares of totalvarat quantiles (i.e. the Lorenz curve ordinates), the mode, mean, standard deviation, variance, half the coefficient of variation squared, Gini coefficient, and quantile ratios p90/p10, p75/p25. This option is not available together withavar(varlist).

poorfrac(#)displays the estimated proportion with values ofvarless than the cut-off specified by#. This option may be specified when replaying results.

cdf(cdfname)creates a new variablecdfnamecontaining the estimated Pareto c.d.f. value F(x) for each x.

pdf(pdfname)creates a new variablepdfnamecontaining the estimated Pareto p.d.f. value f(x) for each x.

robustspecifies that the Huber/White/sandwich estimator of variance is to be used in place of the traditional calculation; see[U] 23.14Obtaining robust variance estimates.robustcombined withcluster()allows observations which are not independent within cluster (although they must be independent between clusters). pweights implyrobust.

cluster(varname)specifies that the observations are independent across groups (clusters) but not necessarily within groups.varnamespecifies to which group each observation belongs; e.g.,cluster(personid)in data with repeated observations on individuals. See[U] 23.14 Obtaining robust variance estimates.cluster()can be used with pweights to produce estimates for unstratified cluster-sampled data. Use the svy prefix for full complex survey design support. Specifyingcluster()impliesrobust.

from(#)specifies a starting value for the maximum likelhood estimation.

level(#)specifies the confidence level, in percent, for the confidence intervals of the coefficients; see help level.

nologsuppresses the iteration log.

maximize_optionscontrol the maximization process. The options available are those shown by maximize. If you are seeing many "(not concave)" messages in the iteration log, using thedifficultortechniqueoptions may help convergence.

Saved resultsIn addition to the usual results saved after

ml,paretofitsaves the following, if no covariates have been specified and the relevant options are used:

e(ba)is the estimated Pareto Type I shape parameter.

e(cdfvar)ande(pdfvar)are the variable names specified for the c.d.f. and the p.d.f.

e(mode),e(mean),e(var),e(sd),e(i2), ande(gini)are the estimated mode, mean, variance, standard deviation, half coefficient of variation squared, Gini coefficient.e(pX), ande(LpX)are the quantiles, and Lorenz ordinates, where X = {1, 5, 10, 20, 25, 30, 40, 50, 60, 70, 75, 80, 90, 95, 99}.

FormulaeThe Pareto (Type I) distribution has cumulative distribution function (c.d.f.)

F(x) = 1 - { x0 / x }^a

where a>0 is a shape parameter (estimated by

paretofit), x0 is a scale parameter, and x >= x0 > 0 is a random variable. The right tail of a Pareto distribution is heavier as a is smaller.The probability density function (p.d.f.) is

f(x) = a*(x0^a) / x^(a+1).

The formulae used to derive the distributional summary statistics presented (optionally) are as follows. The r-th moment about the origin is given by

a*(x0^r) / (a-r)

which exists only if r<a (Kleiber and Kotz, 2003, p. 70). It follows that

mean = a*x0 / (a-1)

variance = a*(x0^2) / [ a*(a-2)*(a-1)^2 ]

from which the standard deviation and half the squared coefficient of variation can be derived. These three statistics are defined only where a>2. The density is decreasing, so the mode is simply

mode = x0.

The quantiles are derived by inverting the distribution function:

x_s = x0*(1-s)^(-1/a), for each 0 < s = F(x_s) < 1.

The median is therefore

median = x0*(2^(1/a)).

The Gini coefficient of inequality is given by

Gini = 1 / (2a - 1).

The Lorenz curve ordinates at each s = F(x_s) are given by

L(s) = 1 - (1 - s)^{1 - 1/a).

Examples. paretofit x

. paretofit x [fw=wgt]

. paretofit

. paretofit x [aw=wgt] , x0(20)

. paretofit, stats poorfrac(100) x0(50)

. paretofit x, avar(age sex) x0(50)

AuthorsStephen P. Jenkins <stephenj@essex.ac.uk>, Institute for Social and Economic Research, University of Essex, Colchester CO4 3SQ, U.K.

Philippe Van Kerm <philippe.vankerm@ceps.lu>, CEPS/INSTEAD, Differdange, Luxembourg.

ReferenceKleiber, C. and Kotz, S. (2003).

Statistical Size Distributions inEconomics and Actuarial Sciences. Hoboken, NJ: John Wiley.

Also seeOnline: help for smfit, dagumfit, gb2fit, lognfit, hillp if installed.