```-------------------------------------------------------------------------------
help for dpplot
-------------------------------------------------------------------------------

Density probability plots

dpplot varname [if exp] [in range] [ , a(#) dist(name) param(numlist)
generate(newvar1 newvar2) line(line_options) graph_options
plot(plot) ]

Description

dpplot plots density probability plots for varname given a reference
distribution, by default normal (Gaussian).

Remarks

To establish notation, and to fix ideas with a concrete example: consider
an observed variable Y, whose distribution we wish to compare with a
normally distributed variable X. That variable has density function f(X),
distribution function P = F(X) and quantile function X = Q(P).  (The
distribution function and the quantile function are inverses of each
other.) Clearly, this notation is fairly general and also covers other
distributions, at least for continuous variables.

The particular density function f(X | parameters) most pertinent to
comparison with data for Y can be computed given values for its
parameters, either estimates from data on Y, or parameter values chosen
for some other good reason. In the case of a normal distribution, these
parameters would usually be the mean and the standard deviation. Such
density functions are often superimposed on histograms or other graphical
displays.  In Stata, histogram has a normal option which adds the normal
density curve corresponding to the mean and standard deviation of the
data shown.

The density function can also be computed indirectly via the quantile
function as f(Q(P)). For example, if P were 0.5, then f(Q(0.5)) would be
the density at the median. In practice P is calculated as so-called
plotting positions p_i attached to values y_(i) of a sample of Y of size
n which have rank i:  that is, the y_(i) are the order statistics y_(1)
<= ... <= y_(n). One simple rule uses p_i = (i - 0.5) / n.  Most other
rules follow one of a family (i - a) / (n - 2a + 1) indexed by a.

Plotting both f(X | parameters) and f(Q(P = p_i)), calculated using
plotting positions, versus observed Y gives two curves. In our example,
the first is normal by construction and the second would be a good
estimate of a normal density if Y were truly normal with the same
parameters. In terms of Stata functions, the two curves are based on
normden((X - mean) / SD)) and normden(invnorm(p_i)). The match or
mismatch between the curves allows graphical assessment of goodness or
badness of fit. What is more, we can use experience from comparing
frequency distributions, as shown on histograms, dot plots or other
similar displays, in comparing or identifying location and scale
differences, skewness, tail weight, tied values, gaps, outliers and so
forth.

Such density probability plots were suggested by Jones and Daly (1995).
See also Jones (2004).  They are best seen as special-purpose plots, like
normal quantile plots and their kin, rather than general-purpose plots,
like histograms or dot plots.

Extending the discussion in Jones and Daly (1995), the advantages (+) and
limitations (-) of these plots include

+1. No choices of binning or origin (cf. histograms, dot plots, etc.)
or of kernel or of degree of smoothing (cf. density estimation) are
required.

+2. Some people find them easier to interpret than quantile-quantile
plots.

+3. They work well for a wide range of sample sizes. At the same
time, as with any other method, a sample of at least moderate size is
preferable (one rule of thumb is >= 25).

+4. If X has bounded support in one or both directions, then this
should be clear on the plot.

-1. Results may be difficult to decipher if observed and reference
distributions differ in modality. For example, if the reference
distribution is unimodal but the observed data hint at bimodality,
nevertheless f(Q(P)) must be unimodal even though f(Y) may not be.
Similarly, when the reference distribution is exponential, then
f(Q(P)) must be monotone decreasing whatever the shape of f(Y).

-2. It may be difficult to discern subtle differences in one or both
tails of the observed and reference distributions.

-3. Comparison is of a curve with a curve: some people argue that
graphical references should where possible be linear (and ideally
horizontal). (A linear reference is a clear advantage of quantile
plots.)

-4. There is no simple extension to comparison of two samples with
each other.

Programmers may wish to inspect the code and add code for other
distributions.  If parameters are not estimated, then naturally their
values must be supplied:  the order of parameters should seem natural or
at least conventional.

Options

a() specifies a family of plotting positions, as explained above. The
default is 0.5. Choice of a is rarely material unless the sample size
is very small, and then the exercise is moot whatever is done.

dist() specifies a distribution to act as a reference.  The distributions
implemented include beta, exponential, gamma, Gumbel, lognormal,
Weibull and normal, the last being the default. Gaussian is a synonym
for normal.

param() specifies parameter values which give a reference distribution.

With dist(normal) two parameters may be specified. The first is the
mean and the second is the standard deviation.

With dist(Weibull) two parameters may be specified. The first is a
scale parameter b and the second a shape parameter c.  (The density
function for a variable x is thus (c/b) (x/b)^(c - 1) exp(-(x/b)^c).)

With dist(lognormal) two parameters may be specified. The first is
the mean of logged values and the second is the standard deviation of
logged values.

With dist(gumbel) two parameters must be specified. The first is a
scale parameter alpha and the second is a location parameter mu.
(The density function for a variable x is thus (1 / alpha) * exp[-(x
- mu) / alpha] * exp[-exp(-(x - mu) / alpha)].) gumbelfit is one
program to estimate parameters.

With dist(gamma) two parameters must be specified. The first is a
shape parameter alpha and the second is a scale parameter beta.  (The
density function for a variable x is thus [1 / (beta^alpha *
Gamma(alpha))] x^(alpha - 1) exp(-x / beta), where Gamma() is the
gamma function.) gammafit is one program to estimate parameters.

With dist(exponential) one parameter may be specified, namely the
mean.

With dist(beta) two parameters must be specified, shape parameters
alpha and beta.  (The density function for a variable x is thus [1 /
Beta(alpha, beta)] x^(alpha - 1) (1 -x)^(beta - 1), where Beta() is
the beta function.) betafit is one program to estimate parameters.

generate() specifies two new variable names to hold the results of
densities estimated from the data directly (as f() given parameters)
and indirectly (as f(Q(P)) given parameters).

line(line_options) are options of twoway mspline and twoway line, which
may be used to control the rendition of the density function curve.

graph_options are options of twoway.

plot(plot) provides a way to add other plots to the generated graph; see
help plot_option.

Examples

. dpplot mpg

. set obs 1000
. gen rnd = invnorm(uniform())
. dpplot rnd, param(0 1)
. dpplot rnd, param(0 1) plot(histogram rnd, bcolor(none) width(0.2))

. dpplot length, dist(lognormal) gen(density1 density2)

. gammafit length
. dpplot length, dist(gamma) param(`e(alpha)' `e(beta)')

Author

Nicholas J. Cox, University of Durham, U.K.
n.j.cox@durham.ac.uk

Acknowledgements

Tim Sofer found a bug.

References

Jones, M.C. 2004. Hazelton, M.L. (2003), "A graphical tool for assessing
normality," The American Statistician 57: 285-288: Comment. The
American Statistician 58: 176-177.

Jones, M.C. and F. Daly. 1995. Density probability plots.  Communications
in Statistics, Simulation and Computation 24: 911-927.

Also see

On-line:  help for twoway, diagplots, gumbelfit (if installed), gammafit
(if installed), betafit (if installed)

```