Fitting a lognormal distribution by ML to unit record data
lognfit var [weight] [if exp] [in range] [, mvar(varlist1) vvar(varlist2) mandv(varlist) stats from(string) poorfrac(#) cdf(cdfname) pdf(pdfname) robust cluster(varname) svy level(#) maximize_options svy_options ]
by ... : may be used with lognfit; see help by.
pweights, aweights, fweights, and iweights are allowed; see help weights. To use pweights, you must first svyset your data and then use the svy option.
Description
lognfit fits by ML the 2 parameter lognormal distribution to sample observations on a random variable var. Unit record data are assumed (rather than grouped data). For a comprehensive review of the lognormal distribution, see Aitchison and Brown (1954). See also Kleiber and Kotz (2003).
The likelihood function for a sample of observations on var is specified as the product of the densities for each observation (weighted where relevant), and is maximized using ml model lf.
Options
mvar(varlist1) and vvar(varlist2) allow the user to specify each parameter as a function of the covariates specified in the respective variable list. A constant term is always included in each equation.
mandv(varlist) can be used instead of the previous option if the same covariates are to appear in each parameter equation.
from(string) specifies initial values for the parameters, and is likely to be used only rarely. You can specify the initial values in one of three ways: the name of a vector containing the initial values (e.g., from(b0) where b0 is a properly labeled vector); by specifying coefficient names with the values (e.g., from(m:_cons=1 v:_cons=5); or by specifying an ordered list of values (e.g., from(1 5, copy)). Poor values in from() may lead to convergence problems. For more details, including the use of copy and skip, see {help:maximize}.
If covariates are specified, the next four options are not available. Use lognpred to generate statistics at particular values of the covariates, or nlcom. predict can be used to generate the observation-specific parameters corresponding to the covariate values of each sample observation: see Examples below.
stats displays selected distributional statistics implied by the lognormal parameter estimates: quantiles, cumulative shares of total var at quantiles (i.e. the Lorenz curve ordinates), the mode, mean, standard deviation, variance, half the coefficient of variation squared, Gini coefficient, and quantile ratios p90/p10, p75/p25.
poorfrac(#) displays the estimated proportion with values of var less than the cut-off specified by #. This option may be specified when replaying results.
cdf(cdfname) creates a new variable cdfname containing the estimated lognormal c.d.f. value F(x) for each x.
pdf(pdfname) creates a new variable pdfname containing the estimated lognormal p.d.f. value f(x) for each x.
robust specifies that the Huber/White/sandwich estimator of variance is to be used in place of the traditional calculation; see [U] 23.14 Obtaining robust variance estimates. robust combined with cluster() allows observations which are not independent within cluster (although they must be independent between clusters). If you specify pweights, robust is implied.
cluster(varname) specifies that the observations are independent across groups (clusters) but not necessarily within groups. varname specifies to which group each observation belongs; e.g., cluster(personid) in data with repeated observations on individuals. See [U] 23.14 Obtaining robust variance estimates. cluster() can be used with pweights to produce estimates for unstratified cluster-sampled data. Specifying cluster() implies robust.
svy indicates that ml is to pick up the svy settings set by svyset and use the robust variance estimator. Thus, this option requires the data to be svyset; see help svyset. svy may not be combined with weights or the strata(), psu(), fpc(), or cluster() options.
level(#) specifies the confidence level, in percent, for the confidence intervals of the coefficients; see help level.
nolog suppresses the iteration log.
maximize_options control the maximization process. The options available are those shown by maximize, with the exception of from(). If you are seeing many "(not concave)" messages in the iteration log, using the difficult or technique options may help convergence.
svy_options specify the options used together with the svy option.
Saved results
In addition to the usual results saved after ml, lognfit also saves the following, if there are no covariates have been specified and the relevant options used:
e(bm) and e(bv) are the estimated lognormal parameters.
e(cdfvar) and e(pdfvar) are the variable names specified for the c.d.f. and the p.d.f.
e(mode), e(mean), e(var), e(sd), e(i2), and e(gini) are the estimated mode, mean, variance, standard deviation, half coefficient of variation squared, Gini coefficient. e(pX), and e(LpX) are the quantiles, and Lorenz ordinates, where X = {1, 5, 10, 20, 25, 30, 40, 50, 60, 70, 75, 80, 90, 95, 99}.
The following results are saved regardless of whether covariates have been specified or not.
e(b_m) and e(b_v) are row vectors containing the parameter estimates from each equation.
e(length_b_m) and e(length_b_v) contain the lengths of these vectors. If no covariates have been specified in an equation, the corresponding vector has length equal to 1 (the constant term); otherwise, the length is one plus the number of covariates.
Formulae
The lognormal distribution has distribution function (c.d.f.)
F(x) = 1 - N( (log(x) - m)/v )
where m and v are parameters, each positive, for random variable x > 0.
the probability density function (p.d.f.) is
f(x) = (x*sqrt(2*_pi)*v)^(-1) *exp( -.5*(v^-2)(log(x) - m)^2 ).
The likelihood function for a sample of observations on var is specified as the product of the densities for each observation (weighted where relevant), and is maximized using ml model lf.
The formulae used to derive the distributional summary statistics presented (optionally) are as follows. The r-th moment about the origin is given by
exp( r*m + .5*(r^2)*(v^2) )
and hence
mean = exp(m + .5*(v^2) )
variance = q*(q-1)*exp(2*m) where q = exp(v^2)
from which the standard deviation and half the squared coefficient of variation can be derived. The mode is
mode = exp(m - v^2).
The quantiles are derived by inverting the distribution function:
x_s = exp( m + v*invnorm(s) ) for each s = F(x_s).
The Gini coefficient of inequality is given by
Gini = 2*norm(v/sqrt(2)) - 1 .
The Lorenz curve ordinates at each s = F(x_s) are
L(s) = norm(invnorm(s) - v^2).
Examples
. lognfit x [w=wgt]
. lognfit
. lognfit x, stats poorfrac(100)
. lognfit, m(age sex) v(age sex)
. lognfit x, mandv(age sex)
. predict double m_i, eq(m) xb
. predict double v_i, eq(v) xb
See also the examples provided in the presentation by Jenkins (2004).
Author
Stephen P. Jenkins <stephenj@essex.ac.uk>, Institute for Social and Economic Research, University of Essex, Colchester CO4 3SQ, U.K.
Acknowledgements
N.J. Cox made numerous helpful comments and suggestions, and also wrote programs for distributional diagnostic plots (qlogn, plogn).
References
Aitchison, J. and Brown, J.A.C. (1957). The Lognormal Distribution. Cambridge: Cambridge University Press.
Jenkins, S.P. (2004). Fitting functional forms to distributions, using ml. Presentation at Second German Stata Users Group Meeting, Berlin. http://www.stata.com/meeting/2german/Jenkins.pdf
Kleiber, C. and Kotz, S. (2003). Statistical Size Distributions in Economics and Actuarial Sciences. Hoboken, NJ: John Wiley.
Also see
Online: help for lognpred, plogn, qlogn, smfit, dagumfit, gb2fit, if installed.