Equal probability histogram
eqprhistogram varname [weight] [if exp] [in range] [ , bin(#) mean plot(plot) graph_options ]
Description
eqprhistogram shows a histogram of the distribution of varname constructed so that each bar represents the same fraction of the data.
fweights and aweights may be specified.
Remarks
As an example, suppose we calculate the minimum, the lower quartile, the median, the upper quartile and the maximum of a variable. These 5 quantiles allow us to draw a histogram with 4 bars, each representing 1/4 of the total (possibly weighted) frequency, or of the total probability 1. The height of each bar representing the average probability density in each interval should be 1 / 4 * (higher quantile - lower quantile). More generally, k bars each representing a fraction or probability 1/k may be drawn after calculating k +1 quantiles equally spaced in terms of cumulative probability. In practice, the lowest and highest of these are obtained from summarize and the others from _pctile.
eqprhistogram refuses to draw graphs when the quantiles calculated are not distinct. This is likely with categorical or discrete or highly rounded data, especially as the number of quantiles approaches the number of values. It is recommended either to ask for fewer bins or to reconsider the appropriateness of the request.
Equal probability histograms have some analytical value. Perhaps their greatest merit is pedagogic, as examples showing the principle behind histograms, that area represents probability, and as a graphic way to show how quantiles relate to the histogram, especially say quartiles, octiles or deciles.
Note that eqprhistogram is not implemented using histogram, but directly.
This kind of graph has been discussed by, for example, Breiman (1973, pp.208-9) and Scott (1992, pp.69-70). Breiman points out that the associated error will be approximately a constant multiple of the bar heights, so long as the bin frequencies are not too small. Scott points out that, in terms of mean integrated squared error, it is a lousy estimator of the underlying probability density function. Simonoff (1996, p.34) gives references for related work from 1969. (Please email the author with details of any earlier or fuller discussions.)
Options
bin() indicates the number of bins. In Stata 8.0, the number of bins may not exceed 20. As of Stata 8.1, the number of bins may not exceed 1000. The default is 8.
mean adds a dashed line indicating the mean of the data.
plot(plot) provides a way to add other plots to the generated graph; see help plot_option.
graph_options refers to options of twoway bar.
Examples
. sysuse auto
. eqprhistogram price . eqprhistogram price, bin(4) . eqprhistogram price, bin(10) plot(kdensity price)
Author
Nicholas J. Cox, University of Durham n.j.cox@durham.ac.uk
Acknowledgements
Marcello Pagano suggested the problem. Vince Wiggins provided very helpful comments, particularly in pointing to the undocumented option bartype(spanning).
References
Breiman, L. 1973. Statistics: with a view towards applications. Boston: Houghton Mifflin.
Scott, D.W. 1992. Multivariate density estimation: theory, practice, and visualization. New York: John Wiley.
Simonoff, J.S. 1996. Smoothing methods in statistics. New York: Springer.
Also see
On-line: help for histogram