------------------------------------------------------------------------------- help fordpplot-------------------------------------------------------------------------------

Density probability plots

dpplotvarname[ifexp] [inrange] [,a(#)dist(name)param(numlist)generate(newvar1 newvar2)line(line_options)graph_optionsplot(plot)]

Description

dpplotplots density probability plots forvarnamegiven a reference distribution, by default normal (Gaussian).

RemarksTo establish notation, and to fix ideas with a concrete example: consider an observed variable

Y, whose distribution we wish to compare with a normally distributed variableX. That variable has density functionf(X), distribution functionP = F(X) and quantile functionX = Q(P). (The distribution function and the quantile function are inverses of each other.) Clearly, this notation is fairly general and also covers other distributions, at least for continuous variables.The particular density function

f(X| parameters) most pertinent to comparison with data forYcan be computed given values for its parameters, either estimates from data onY, or parameter values chosen for some other good reason. In the case of a normal distribution, these parameters would usually be the mean and the standard deviation. Such density functions are often superimposed on histograms or other graphical displays. In Stata, histogram has anormaloption which adds the normal density curve corresponding to the mean and standard deviation of the data shown.The density function can also be computed indirectly via the quantile function as

f(Q(P)). For example, ifPwere 0.5, thenf(Q(0.5)) would be the density at the median. In practicePis calculated as so-called plotting positionsp_iattached to valuesy_(i) of a sample ofYof sizenwhich have ranki: that is, they_(i) are the order statisticsy_(1) <= ... <=y_(n). One simple rule usesp_i= (i- 0.5) /n. Most other rules follow one of a family (i-a) / (n- 2a+ 1) indexed bya.Plotting both

f(X| parameters) andf(Q(P=p_i)), calculated using plotting positions, versus observedYgives two curves. In our example, the first is normal by construction and the second would be a good estimate of a normal density ifYwere truly normal with the same parameters. In terms of Stata functions, the two curves are based onnormden((X- mean) / SD))andnormden(invnorm(p_i)). The match or mismatch between the curves allows graphical assessment of goodness or badness of fit. What is more, we can use experience from comparing frequency distributions, as shown on histograms, dot plots or other similar displays, in comparing or identifying location and scale differences, skewness, tail weight, tied values, gaps, outliers and so forth.Such

density probability plotswere suggested by Jones and Daly (1995). See also Jones (2004). They are best seen as special-purpose plots, like normal quantile plots and their kin, rather than general-purpose plots, like histograms or dot plots.Extending the discussion in Jones and Daly (1995), the advantages (+) and limitations (-) of these plots include

+1. No choices of binning or origin (cf. histograms, dot plots, etc.) or of kernel or of degree of smoothing (cf. density estimation) are required.

+2. Some people find them easier to interpret than quantile-quantile plots.

+3. They work well for a wide range of sample sizes. At the same time, as with any other method, a sample of at least moderate size is preferable (one rule of thumb is >= 25).

+4. If

Xhas bounded support in one or both directions, then this should be clear on the plot.-1. Results may be difficult to decipher if observed and reference distributions differ in modality. For example, if the reference distribution is unimodal but the observed data hint at bimodality, nevertheless

f(Q(P)) must be unimodal even thoughf(Y) may not be. Similarly, when the reference distribution is exponential, thenf(Q(P)) must be monotone decreasing whatever the shape off(Y).-2. It may be difficult to discern subtle differences in one or both tails of the observed and reference distributions.

-3. Comparison is of a curve with a curve: some people argue that graphical references should where possible be linear (and ideally horizontal). (A linear reference is a clear advantage of quantile plots.)

-4. There is no simple extension to comparison of two samples with each other.

Programmers may wish to inspect the code and add code for other distributions. If parameters are not estimated, then naturally their values must be supplied: the order of parameters should seem natural or at least conventional.

Options

a()specifies a family of plotting positions, as explained above. The default is 0.5. Choice ofais rarely material unless the sample size is very small, and then the exercise is moot whatever is done.

dist()specifies a distribution to act as a reference. The distributions implemented includebeta,exponential,gamma,Gumbel,lognormal,Weibullandnormal, the last being the default.Gaussianis a synonym fornormal.

param()specifies parameter values which give a reference distribution.With

dist(normal)two parameters may be specified. The first is the mean and the second is the standard deviation.With

dist(Weibull)two parameters may be specified. The first is a scale parameterband the second a shape parameterc. (The density function for a variablexis thus (c/b) (x/b)^(c- 1) exp(-(x/b)^c).)With

dist(lognormal)two parameters may be specified. The first is the mean of logged values and the second is the standard deviation of logged values.With

dist(gumbel)two parameters must be specified. The first is a scale parameter alpha and the second is a location parameter mu. (The density function for a variablexis thus (1 / alpha) * exp[-(x- mu) / alpha] * exp[-exp(-(x- mu) / alpha)].) gumbelfit is one program to estimate parameters.With

dist(gamma)two parameters must be specified. The first is a shape parameter alpha and the second is a scale parameter beta. (The density function for a variablexis thus [1 / (beta^alpha * Gamma(alpha))]x^(alpha - 1) exp(-x/ beta), where Gamma() is the gamma function.) gammafit is one program to estimate parameters.With

dist(exponential)one parameter may be specified, namely the mean.With

dist(beta)two parameters must be specified, shape parameters alpha and beta. (The density function for a variablexis thus [1 / Beta(alpha, beta)]x^(alpha - 1) (1 -x)^(beta - 1), where Beta() is the beta function.) betafit is one program to estimate parameters.

generate()specifies two new variable names to hold the results of densities estimated from the data directly (asf() given parameters) and indirectly (asf(Q(P)) given parameters).

line(line_options)are options of twoway mspline and twoway line, which may be used to control the rendition of the density function curve.

graph_optionsare options of twoway.

plot(plot)provides a way to add other plots to the generated graph; see help plot_option.

Examples. dpplot mpg

. set obs 1000 . gen rnd = invnorm(uniform()) . dpplot rnd, param(0 1) . dpplot rnd, param(0 1) plot(histogram rnd, bcolor(none) width(0.2))

. dpplot length, dist(lognormal) gen(density1 density2)

. gammafit length . dpplot length, dist(gamma) param(`e(alpha)' `e(beta)')

AuthorNicholas J. Cox, University of Durham, U.K. n.j.cox@durham.ac.uk

AcknowledgementsTim Sofer found a bug.

ReferencesJones, M.C. 2004. Hazelton, M.L. (2003), "A graphical tool for assessing normality,"

The American Statistician57: 285-288: Comment.TheAmerican Statistician58: 176-177.Jones, M.C. and F. Daly. 1995. Density probability plots.

Communicationsin Statistics, Simulation and Computation24: 911-927.

Also seeOn-line: help for twoway, diagplots, gumbelfit (if installed), gammafit (if installed), betafit (if installed)