-------------------------------------------------------------------------------
help for paretofit          Stephen P. Jenkins & Philippe Van Kerm (April 2007)
-------------------------------------------------------------------------------

Fitting a Pareto (Type I) distribution by ML to unit record data

        paretofit var [weight] [if exp] [in range] [, avar(varlist)
                 cdf(cdfname) pdf(pdfname) robust cluster(varname) from(#)
                 level(#) maximize_options ]

    by and svy prefixes are allowed (but not jointly); see prefix.

    fweights, aweights, pweights, and iweights are allowed; see weight.


Description

    paretofit fits by ML a Pareto (Type I) distribution to sample
    observations on a random variable var. Unit record data are assumed
    (rather than grouped data).

    The Pareto distribution is named after the Italian economist Vilfredo
    Pareto (1848-1923). It is one of the most famous and widely studied
    statistical size distributions.  It is well-known for approximating
    wealth distributions, but has applications in many different fields (e.g.
    for size distibutions of human setllements, sand particles, word
    frequencies, or for assessing portfolio risk). See Kleiber and Kotz
    (2003) for a comprehensive review of the Pareto (and other)
    distributions.

    The likelihood function for a sample of observations on var is specified
    as the product of the densities for each observation (weighted where
    relevant), and is maximized using ml model lf. A closed-form expression
    for the ML estimator of the Pareto Type I shape parameter is readily
    available but estimation with ml model allows us to accommodate various
    sample design easily, as well as inclusion of covariates.


Options

    x0(scalar) specifies the scale parameter of the Pareto distribution (see
        formula below). The Pareto distribution is fitted only to sample
        observations where var>=x0.  By default, x0 is set to the minimum
        value of var (within the sub-sample identified by if and in clauses).

    avar(varlist) allows the user to specify the shape parameter of the
        distribution as a function of the covariates specified in varlist. A
        constant term is always included.

    stats displays selected distributional statistics implied by the Pareto
        parameter estimate:  quantiles, cumulative shares of total var at
        quantiles (i.e. the Lorenz curve ordinates), the mode, mean, standard
        deviation, variance, half the coefficient of variation squared, Gini
        coefficient, and quantile ratios p90/p10, p75/p25. This option is not
        available together with avar(varlist).

    poorfrac(#) displays the estimated proportion with values of var less
        than the cut-off specified by #. This option may be specified when
        replaying results.

    cdf(cdfname) creates a new variable cdfname containing the estimated
        Pareto c.d.f. value F(x) for each x.

    pdf(pdfname) creates a new variable pdfname containing the estimated
        Pareto p.d.f. value f(x) for each x.


    robust specifies that the Huber/White/sandwich estimator of variance is
        to be used in place of the traditional calculation; see [U] 23.14
        Obtaining robust variance estimates.  robust combined with cluster()
        allows observations which are not independent within cluster
        (although they must be independent between clusters).  pweights imply
        robust.

    cluster(varname) specifies that the observations are independent across
        groups (clusters) but not necessarily within groups.  varname
        specifies to which group each observation belongs; e.g.,
        cluster(personid) in data with repeated observations on individuals.
        See [U] 23.14 Obtaining robust variance estimates. cluster() can be
        used with pweights to produce estimates for unstratified
        cluster-sampled data. Use the svy prefix for full complex survey
        design support. Specifying cluster() implies robust.

    from(#) specifies a starting value for the maximum likelhood estimation.

    level(#) specifies the confidence level, in percent, for the confidence
        intervals of the coefficients; see help level.

    nolog suppresses the iteration log.

    maximize_options control the maximization process. The options available
        are those shown by maximize. If you are seeing many "(not concave)"
        messages in the iteration log, using the difficult or technique
        options may help convergence.


Saved results

    In addition to the usual results saved after ml, paretofit saves the
    following, if no covariates have been specified and the relevant options
    are used:

    e(ba) is the estimated Pareto Type I shape parameter.

    e(cdfvar) and e(pdfvar) are the variable names specified for the c.d.f.
    and the p.d.f.

    e(mode), e(mean), e(var), e(sd), e(i2), and e(gini) are the estimated
    mode, mean, variance, standard deviation, half coefficient of variation
    squared, Gini coefficient. e(pX), and e(LpX) are the quantiles, and
    Lorenz ordinates, where X = {1, 5, 10, 20, 25, 30, 40, 50, 60, 70, 75,
    80, 90, 95, 99}.


Formulae

    The Pareto (Type I) distribution has cumulative distribution function
    (c.d.f.)

        F(x) = 1 - { x0 / x }^a

    where a>0 is a shape parameter (estimated by paretofit), x0 is a scale
    parameter, and x >= x0 > 0 is a random variable.  The right tail of a
    Pareto distribution is heavier as a is smaller.

    The probability density function (p.d.f.) is

        f(x) = a*(x0^a) / x^(a+1).

    The formulae used to derive the distributional summary statistics
    presented (optionally) are as follows. The r-th moment about the origin
    is given by

        a*(x0^r) / (a-r)

    which exists only if r<a (Kleiber and Kotz, 2003, p. 70).  It follows
    that

        mean = a*x0 / (a-1)

        variance = a*(x0^2) / [ a*(a-2)*(a-1)^2 ]

    from which the standard deviation and half the squared coefficient of
    variation can be derived. These three statistics are defined only where
    a>2. The density is decreasing, so the mode is simply

        mode = x0.

    The quantiles are derived by inverting the distribution function:

        x_s = x0*(1-s)^(-1/a), for each 0 < s = F(x_s) < 1.

    The median is therefore

        median = x0*(2^(1/a)).

    The Gini coefficient of inequality is given by

        Gini = 1 / (2a - 1).

    The Lorenz curve ordinates at each s = F(x_s) are given by

        L(s) = 1 - (1 - s)^{1 - 1/a).

        
Examples

    . paretofit x

    . paretofit x [fw=wgt]

    . paretofit

    . paretofit x [aw=wgt] , x0(20)

    . paretofit, stats poorfrac(100) x0(50)

    . paretofit x, avar(age sex) x0(50)


Authors

    Stephen P. Jenkins <stephenj@essex.ac.uk>, Institute for Social and
    Economic Research, University of Essex, Colchester CO4 3SQ, U.K.

    Philippe Van Kerm <philippe.vankerm@ceps.lu>, CEPS/INSTEAD, Differdange,
    Luxembourg.


Reference

    Kleiber, C. and Kotz, S. (2003).  Statistical Size Distributions in
        Economics and Actuarial Sciences.  Hoboken, NJ: John Wiley.


Also see

    Online: help for smfit, dagumfit, gb2fit, lognfit, hillp if installed.