```-------------------------------------------------------------------------------
help for dpplot7
-------------------------------------------------------------------------------

Density probability plots

dpplot7 varname [if exp] [in range] [ , a(#) dist(name) param(numlist)
graph_options ]

Description

dpplot7 plots density probability plots for varname given a reference
distribution, by default normal (Gaussian).

Note: dpplot7 is the original version, written for Stata 7, of dpplot. Users of
Stata 8 or later should switch to dpplot.

Remarks

To establish notation, and to fix ideas with a concrete example: consider an
observed variable Y, whose distribution we wish to compare with a normally
distributed variable X. That variable has density function f(X), distribution
function P = F(X) and quantile function X = Q(P). (The distribution function
and the quantile function are inverses of each other.) Clearly, this notation
is fairly general and also covers other distributions, at least for continuous
variables.

The particular density function f(X | parameters) most pertinent to comparison
with data for Y can be computed given values for its parameters, either
estimates from data on Y, or parameter values chosen for some other good
reason. In the case of a normal distribution, these parameters would usually be
the mean and the standard deviation. Such density functions are often
superimposed on histograms or other graphical displays.  In Stata, graph,
histogram has a normal option which adds the normal density curve corresponding
to the mean and standard deviation of the data shown.

The density function can also be computed indirectly via the quantile function
as f(Q(P)). For example, if P were 0.5, then f(Q(0.5)) would be the density at
the median. In practice P is calculated as so-called plotting positions p_i
attached to values y_(i) of a sample of Y of size n which have rank i: that is,
the y_(i) are the order statistics y_(1) <= ... <= y_(n). One simple rule uses
p_i = (i - 0.5) / n.  Most other rules follow one of a family (i - a) / (n - 2a
+ 1) indexed by a.

Plotting both f(X | parameters) and f(Q(P = p_i)), calculated using plotting
positions, versus observed Y gives two curves. In our example, the first is
normal by construction and the second would be a good estimate of a normal
density if Y were truly normal with the same parameters. In terms of Stata
functions, the two curves are based on normden((X - mean) / SD)) and
normden(invnorm(p_i)). The match or mismatch between the curves allows
graphical assessment of goodness or badness of fit. What is more, we can use
experience from comparing frequency distributions, as shown on histograms, dot
plots or other similar displays, in comparing or identifying location and scale
differences, skewness, tail weight, tied values, gaps, outliers and so forth.

Such density probability plots were suggested by Jones and Daly (1995).  They
are best seen as special-purpose plots, like normal quantile plots and their
kin, rather than general-purpose plots, like histograms or dot plots.

Extending the discussion in Jones and Daly (1995), the advantages (+) and
limitations (-) of these plots include

+1. No choices of binning or origin (cf. histograms, dot plots, etc.) or of
kernel or of degree of smoothing (cf. density estimation) are required.

+2. Some people find them easier to interpret than quantile-quantile plots.

+3. They work well for a wide range of sample sizes. At the same time, as
with any other method, a sample of at least moderate size is preferable
(one rule of thumb is >= 25).

+4. If X has bounded support in one or both directions, then this should be
clear on the plot.

-1. Results may be difficult to decipher if observed and reference
distributions differ in modality. For example, if the reference
distribution is unimodal but the observed data hint at bimodality,
nevertheless f(Q(P)) must be unimodal even though f(Y) may not be.
Similarly, when the reference distribution is exponential, then f(Q(P))
must be monotone decreasing whatever the shape of f(Y).

-2. It may be difficult to discern subtle differences in one or both tails
of the observed and reference distributions.

-3. Comparison is of a curve with a curve: some people argue that graphical
references should where possible be linear (and ideally horizontal). (A
linear reference is a clear advantage of quantile plots.)

-4. There is no simple extension to comparison of two samples with each
other.

Programmers may wish to inspect the code and add code for other distributions.
If parameters are not estimated, then naturally their values must be supplied:
the order of parameters should seem natural or at least conventional.

Options

graph_options are options of graph, twoway.  The defaults include gap(4)
symbol(oi) connect(.s) l1title("Probability density") xla yla.

a() specifies a family of plotting positions, as explained above. The default
is 0.5. Choice of a is rarely material unless the sample size is very
small, and then the exercise is moot whatever is done.

dist() specifies a distribution to act as a reference. In this preliminary
version, the distributions allowed are exponential and normal, the latter
being the default. Gaussian is a synonym for normal. Abbreviations down to
at least three letters (e.g. exp, nor) are allowed.

param() specifies parameter values which give reference distributions;
specifications override parameters estimated from the data.

With dist(normal) up to two parameters may be specified. The first is the
mean and the second is the standard deviation.

With dist(exponential) one parameter may be specified, namely the mean.

Examples

. dpplot7 mpg

. set obs 1000
. gen rnd = invnorm(uniform())
. dpplot7 rnd, param(0 1)

Author

Nicholas J. Cox, University of Durham, U.K.
n.j.cox@durham.ac.uk

References

Jones, M.C. and F. Daly. 1995. Density probability plots.  Communications in
Statistics, Simulation and Computation 24: 911-927.

Also see

On-line:  help for graph, diagplots

```