One-, two- and three-way bar charts for tables
tabplot rowvar colvar [if exp] [in range] [weight] [ , [ fraction | fraction(varlist) | percent | percent(varlist) ] missing yasis xasis height(#) showval[(specification)] minimum(#) maximum(#) separate(sepspec) bar1(rbar_options) ... bar20(rbar_options) barall(rbar_options) graph_options [ plot(plot) | addplot(plot) ] ]
tabplot varname [if exp] [in range] [weight] [ , [ fraction | fraction(varlist) | percent | percent(varlist) ] missing yasis xasis height(#) showval[(specification)] minimum(#) maximum(#) separate(sepspec) bar1(rbar_options) ... bar20(rbar_options) barall(rbar_options) graph_options [ plot(plot) | addplot(plot) ] ]
fweights, aweights and iweights may be specified.
Description
tabplot plots a table of numerical values (e.g. frequencies, fractions or percents) in graphical form as a bar chart. It is mainly intended for representing contingency tables for one, two or three categorical variables. It also has uses for producing multiple histograms and graphs for general one-, two- or three-way tables.
tabplot rowvar colvar follows the standard tabular alignment: the categories of rowvar define rows from top (low values) to bottom (high values) and the categories of colvar define columns from left (low values) to right (high values). The frequency (fraction, percent) of each combination of row and column is shown as a bar, with default alignment vertical and default width 0.5. Use the barwidth() option to vary width, but note that all bars will have the same width. By default both variables are mapped on the fly in sort order to successive integers from 1 up, but original values or value labels are used as value labels: this may be varied by use of the yasis or xasis options.
Alternatively, tabplot varname creates a bar chart which by default displays one set of vertical bars; with the horizontal option it displays one set of horizontal bars. The categories of varname thus define either columns from left (low values) to right (high values) or rows from top (low values) to bottom (high values). The frequency (fraction, percent) of each column or row is shown as a bar.
Remarks
The display is deliberately minimal. No numeric scales are shown for reading off numeric values, although optionally numeric values may be shown below bars by use of the showval option. Above all, there is no facility for any kind of three-dimensional display or effect. The maximum value (or more generally biggest value) shown is indicated by use of note(), unless showval or showval() is specified.
In contrast to a table, in which it is easier to compare values down columns, it is usually easier to compare values across rows whenever bars are vertical. A simple alternative is to use the horizontal option, in which case it is usually easier to compare down columns. Some experimentation with both forms and with percent(rowvar) or percent(colvar) will often be helpful.
tabplot rowvar colvar, by() is the way to plot three-way tables. The variable specified in by() is used to produce a set of graphs in several panels. Similarly, tabplot varname, by() is another way to plot two-way tables.
tabplot with the xasis option may be useful for stacking histograms vertically. Less commonly, with the yasis and horizontal options it may be useful for stacking them horizontally. A typical protocol would be, for mpg shown in bins of width 2.5 mpg,
. gen midpoint = round(mpg, 2.5) . _crcslbl midpoint mpg . tabplot foreign midpoint, xasis barw(2.5) bstyle(histogram) percent(foreign)
In general, specify a variable containing equally-spaced midpoints and assign to it an appropriate variable label. tabplot will do the rest. Omit the percent() option for display of frequencies.
A recipe for subverting tabplot to plot any variable that takes on a single value for each cross-combination of categories is illustrated in the examples below. The key is to select precisely one observation for each cross-combination and to specify that variable as (most generally) an iweight.
Furthermore, using an iweight is the only possible method whenever a variable has at least some negative values. In that case,
1. Consider changing the maximum height through height() to avoid overlap of bars variously representing positive and negative values. By default tabplot chooses the scale to accommodate the longest bar to be shown, but it contains no special intelligence otherwise to avoid overlap of bars in the same column or row.
2. If also using showval or showval(), consider changing the offset() and using a transparent bfcolor().
Bar charts presented as one row or one column of bars go back at least as far as Playfair (1786). See (e.g.) Playfair (2005, p.25) or Wainer (2005, p.45; 2009, p.174).
Bar charts presented in table form with two or more rows and two or more columns are less common.
They have been used in one form of pollen diagram. Sears (1933, 1935) gave some early examples.
Brinton (1939), Neurath (1939), Rogers (1961), Lockwood (1969), Doran and Hodson (1975), Bertin (1981, 1983), Lebart, Morineau and Warwick (1984), Anderson and May (1991), Chapman and Wykes (1996), de Falguerolles et al. (1997), Chauchat and Risson (1998), MacKay (2003, 2008), Wilkinson (2005), Unwin, Theus and Hofmann (2006), Hahsler, Hornik and Buchta (2008), Hofmann (2008), Theus and Urbanek (2009) and Few (2009, 2012) also give a variety of examples.
Options
fraction indicates that all frequencies should be shown as fractions (with sum 1) of the total frequency of all values being represented in the graph.
fraction(varlist) indicates that all frequencies should be shown as fractions (with sum 1) of the total frequency for each distinct category defined by the combinations of varlist. Usually, varlist will be either rowvar or colvar.
percent indicates that all frequencies should be shown as percents (with sum 100) of the total frequency of all values being represented in the graph.
percent(varlist) indicates that all frequencies should be shown as percents (with sum 100) of the total frequency for each distinct category defined by the combinations of varlist. Usually, varlist will be either rowvar or colvar.
Only one of these fraction[()] and percent[()] options may be specified.
missing specifies that any missing values of any of the variables specified should also be included within their own categories.
yasis and xasis specify respectively that the y (row) variable and the x (column) variable are to be treated literally (that is, numerically). Most commonly, each option will be specified if the variable in question is a measured scale or a graded variable with gaps. If values 1 to 5 are labelled A to E, but no value of 4 (D) is present in the data, yasis or xasis prevents a mapping to 1 (A) ... 4 (E).
height(#) controls the amount of graph space taken up by bars. The default is 0.8. Note that the height may need to be much smaller or much larger with yasis or xasis, given that the latter take values literally.
showval specifies that numeric values are be shown beneath (or if horizontal is specified to the left of) bars.
showval may also be specified with a variable name and/or options. If options alone are specified, no comma should be given. In particular,
showval(varname) would specify that the values to be shown are those of varname. For example, the values of some kind of residuals might be shown alongside frequency bars.
showval(offset(#)) specifies an offset between the base of the bar and the position of the numeric value. Default is 0.1 with two variables or 0.02 with one variable. Tweak this if the spacing is too large or too small.
showval(format(format)) specifies a format with which to show values. Specifying a format will often be advisable with non-integers. Example: showval(format(%2.1f)) specifies rounding to 1 decimal place. Note that with a specified variable the format defaults to the format of that variable; with percent options the format defaults to %2.1f (1 decimal place); with fraction options the format defaults to %4.3f (3 decimal places).
showval(varname, format(%2.1f)) is an example of varname specified with options.
Otherwise the options of showval() can be options of scatter, most usually marker label options.
minimum() suppresses plotting of bars with values less than the minimum specified, in effect setting them to zero.
maximum() truncates bars with values more than the maximum specified to show that maximum.
separate() specifies that bars associated with different sepspec will be shown differently, most obviously using different colours. sepspec is passed as an argument to the by() option of separate, except that references to @ are first translated to be references to the quantity being plotted.
A call to separate() may be supplemented with calls to options bar1() ... bar20 and/or to barall(). The arguments should be options of twoway rbar.
Options bar1() to bar20() are provided to allow overriding the defaults on up to 20 categories, the first, second, etc., shown. The limit of 20 is plucked out of the air as more than any user should really want. The option barall() is available to override the defaults for all bars. Any bar? option always overrides barall(). Thus if you wanted thicker blwidth() on all bars you could specify barall(blwidth(thick)). If you wanted to highlight the first category only you could specify bar1(blwidth(thick)).
graph_options refers to options of twoway rbar. Among others:
barwidth() specifies the widths of the bars. The default is 0.5. This may need changing, especially with option xasis or yasis.
by() specifies another variable used to subdivide the display into panels.
recast() recasts the graph as another twoway plottype. In practice, recast(rspike) is the main alternative.
plot(plot) provides a way to add other plots to the generated graph. Allowed in Stata 8 only.
addplot(addplot) provides a way to add other plots to the generated graph. Allowed in Stata 9 upwards.
With large datasets especially, it is advisable to ensure that the extra plot(s) do(es) not contain information repeated for every observation within each combination of rowvar and colvar. The examples show one technique for avoiding this.
Examples
. sysuse auto, clear
. tabplot for rep78 . tabplot for rep78, showval . tabplot for rep78, percent(foreign) showval(offset(0.05) format(%2.1f)) . tabplot for rep78, percent(foreign) sep(foreign) bar1(bcolor(red*0.5)) bar2(bcolor(blue*0.5)) showval(offset(0.05) format(%2.1f))
. tabplot rep78 mpg, xasis barw(1) bstyle(histogram)
. egen mean = mean(mpg), by(rep78) . gen rep78_2 = 6 - rep78 - 0.05 . bysort rep78 : gen byte tag = _n == 1 . tabplot rep78 mpg, xasis barw(1) bstyle(histogram) addplot(scatter rep78_2 mean if tag)
. tabplot rep78 . tabplot rep78, showval . tabplot rep78, showval horizontal
. egen mean2 = mean(mpg), by(foreign rep78) . egen tag = tag(foreign rep78) . tabplot foreign rep78 if tag [iw=mean2], showval(format(%2.1f)) subtitle(mean miles per gallon) . su mpg . tabplot foreign rep78 if tag [iw=mean2], showval(format(%2.1f)) separate(@ > 21.29) bar1(bcolor(red)) bar2(bcolor(black))
. webuse rate2, clear . tabplot rad?, percent showval . count . bysort rada radb : gen show = string(_N) + " " + string(_N * 100/85, "%2.1f") + "%" . tabplot rad?, showval(show)
Author
Nicholas J. Cox, Durham University, U.K. n.j.cox@durham.ac.uk
Acknowledgments
Bob Fitzgerald, Friedrich Huebler and Martyn Sherriff found typos in this help. Friedrich also pointed to various efficiency issues. Marcello Pagano provided encouragement and found a bug. Vince Wiggins suggested how best to align x-axis labels when bars are horizontal.
References
Anderson, R.M. and May, R.M. 1991. Infectious diseases of humans: dynamics and control. Oxford: Oxford University Press.
Bertin, J. 1981. Graphics and graphic information-processing. Berlin: Walter de Gruyter.
Bertin, J. 1983. Semiology of graphics: Diagrams, networks, maps. Madison: University of Wisconsin Press.
Brinton, W.C. 1939. Graphic presentation. New York: Brinton Associates. http://www.archive.org/stream/graphicpresentat00brinrich
Chapman, M. and Wykes, C. 1996. Plain figures. London: The Stationery Office.
Chauchat, J.-H. and Risson, A. 1998. Bertin's graphics and multidimensional data analysis. In Blasius, J. and Greenacre, M. (Eds) Visualization of Categorical Data San Diego, CA: Academic Press, 37-45.
Cox, N.J. 2004. Graphing categorical and compositional data. Stata Journal 4: 190-215.
Cox, N.J. 2008. Spineplots and their kin. Stata Journal 8: 105-121.
de Falguerolles, A., Friedrich, F. and Sawitzki, G. 1997. A tribute to J. Bertin's graphical data analysis. In Bandilla, W. and Faulbaum, F. (Eds) Advances in Statistical Software 6. Stuttgart: Lucius and Lucius, 11-20. http://statlab.uni-hd.de/reports/by.series/beitrag.34.pdf
Doran, J.E. and Hodson, F.R. 1975. Mathematics and computers in archaeology. Edinburgh: Edinburgh University Press. See p.118.
Few, S. 2009. Now you see it: Simple visualization techniques for quantitative analysis. Oakland, CA: Analytics Press.
Few, S. 2012. Show me the numbers: Designing tables and graphs to enlighten. Burlingame, CA: Analytics Press.
Hahsler, M., Hornik, K. and Buchta, C. 2008. Getting things in order: an introduction to the R package seriation. Journal of Statistical Software 25(3) http://www.jstatsoft.org/v25/i03
Hofmann, H. 2008. Mosaic plots and their variants. In Chen, C., Härdle, W. and Unwin, A. (Eds) Handbook of data visualization. Berlin: Springer, 617-642.
Lebart, L., Morineau, A. and Warwick, K.M. 1984. Multivariate descriptive statistical analysis: Correspondence analysis and related techniques for large matrices. New York: John Wiley. See p.50.
Lockwood, A. 1969. Diagrams: A visual survey of graphs, maps, charts and diagrams for the graphic designer. London: Studio Vista. See pp.27, 32, 45, 53, 61, 62.
MacKay, D.J.C. 2003. Information theory, inference, and learning algorithms. Cambridge: Cambridge University Press.
MacKay, D.J.C. 2008. Sustainable energy - without the hot air. Cambridge: UIT Cambridge.
Neurath, O. 1939. Modern man in the making. London: Secker and Warburg. See p.74.
Playfair, W. 1786. The commercial and political atlas. London: Debrett; Robinson; and Sewell.
Playfair, W. 2005. The commercial and political atlas and Statistical breviary. (eds. Wainer, H. and Spence, I.) Cambridge: Cambridge University Press.
Rogers, A.C. 1961. Graphic charts handbook. Washington, DC: Public Affairs Press.
Sears, P.B. 1933. Climatic change as a factor in forest succession. Journal of Forestry 31: 934-942.
Sears, P.B. 1935. Types of North American pollen profiles. Ecology 16: 488-499.
Theus, M. and Urbanek, S. 2009. Interactive graphics for data analysis: Principles and examples. Boca Raton, FL: CRC Press.
Unwin, A., Theus, M. and Hofmann, H. 2006. Graphics of large datasets: Visualizing a million. New York: Springer.
Wainer, H. 2005. Graphic discovery: A trout in the milk and other visual adventures. Princeton, NJ: Princeton University Press.
Wainer, H. 2009. Picturing the uncertain world: How to understand, communicate, and control uncertainty through graphical display. Princeton, NJ: Princeton University Press.
Wilkinson, L. 2005. The grammar of graphics. New York: Springer.
Also see
On-line: help for twoway rbar, histogram, catplot (if installed), spineplot (if installed)