```-------------------------------------------------------------------------------
help for pdplot
-------------------------------------------------------------------------------

Pareto dot plot

pdplot catvar [[if] [in]] [weight] [ , aiopts(options) dotsonly
horizontal level(#) nreps(#) options ]

fweights are allowed; see help weights.

Description

pdplot produces a Pareto dot plot as proposed by Wilkinson (2006).  The
frequencies of the categories of catvar are shown in order by a series of
dots against a magnitude scale. As backdrop, corresponding acceptance
intervals are shown by bars.

The command is more flexible than this description of default behaviour
implies. The intervals can be suppressed and the dot plot can be recast()
to another kind of twoway plot.

Remarks

Wilkinson (2006) briefly reviews Pareto charts which commonly combine two
displays in one. Frequencies in various categories are shown by a series
of bars arranged in frequency order, from most common downwards.  On that
is often superimposed a rising curve showing cumulative frequency.
Frequency and cumulative frequency may or may not have consistent scales.
Examples from quality management studies often show categories of
accidents, complaints, defects, failures, rejects, returns, or other such
unwelcome phenomena. Wilkinson gives several cogent criticisms of this
design and suggests an alternative: show frequencies in order, but by a
dot plot, but add as reference a series of acceptance intervals.

The acceptance intervals are calculated by simulation. Imagine as
benchmark a population in which k categories are equally probable, and
imagine taking samples of size n. Here k and n are the same as those in
the data under consideration. Just by chance the observed frequencies of
the k categories will typically differ. For each sample we can label the
frequencies f_(1) >= f_(2) > ... >= f_(k-1) >= f_(k): thus f_(1) is the
frequency of the most abundant category, and so forth. Across several
samples we can get order statistics for each f_(j) and use those to
calculate intervals with desired coverage.

The acceptance intervals should not be overinterpreted, for various
different reasons. First, the reference distribution is just an agnostic
guess assuming that we know just the numbers of values and categories and
that we have no reason to suppose that categories differ in probability.
More commonly, we would not really expect that the categories are equal
in probability; it is just that we would rarely know how to make our
expectations precise. Second, although making the sample size bigger will
stabilise results, some variability will always be experienced in
intervals produced by simulation. Third, there are various slightly
different recipes for producing percent points from order statistics and
only one is wired in here.

Vilfredo Pareto (1848-1923) was an Italian sociologist, economist and
philosopher, perhaps best remembered for his work on income distributions
and what is now called Pareto efficiency.  See
http://en.wikipedia.org/wiki/Vilfredo_Pareto.  Contrary to many
statements, there appears to be no evidence that he used what is now
known as the Pareto chart, which seems to have emerged in quality
management after 1951. See Wilkinson (2006) for more on the latter point.

Pareto charts are commonly shown as vertical bar charts with the awkward
consequence that long text labels are aligned vertically or obliquely,
and so made difficult to read. The horizontal option allows you to
override the default.

Options

aiopts(options) are options tuning the appearance of the acceptance
intervals. Note that the default options are barw(0.2) bcolor(none).
The bars are laid down before the dots, so more colourful bars are
possible. For comparison note that the bars are centred at ranks 1,
2, etc., so that the default bar width is 20% of the possible.

dotsonly suppresses the simulation and the addition of acceptance
intervals. In that case any aiopts(), level() or nreps() has no
effect.

horizontal specifies that the display should be aligned horizontally. The
default is vertical. Note that horizontal with recast() works but is
not useful.

level(#) specifies a coverage level as a percent for the acceptance
intervals. The default is given by c(level), which in turn defaults
to 95. See help on level if desired.

nreps(#) specifies the number of repetitions of random drawing of a
sample of size n from a population with k equally frequent
categories. The default is 10000. For indicative, exploratory work,
fewer repetitions may be adequate. Note that large numbers of
repetitions, especially with large n, have implications for time and
memory.  Positively, pdplot is most needed and most useful when the
sample number is small. For reproducible results, set seed
beforehand.

options are other twoway options tuning the appearance of the graph. For
example, note that a dot display is not compulsory and can be
replaced using with the recast() option.

Examples

. pdplot category
. pdplot category, recast(bar) barw(0.1)
. pdplot category, dotsonly

Acknowledgements

Author

Nicholas J. Cox, Durham University
n.j.cox@durham.ac.uk

References

Wilkinson, Leland. 2006. Revising the Pareto chart.  American
Statistician 60(4): 332-334.

Also see

On-line: help for graph bar, graph dot, catplot (if installed)

```