Spine plots for two-way categorical data
spineplot yvar xvar [weight] [if exp] [in range] [ , bar1(twoway_bar_options) ... bar20(twoway_bar_options) barall(twoway_bar_options) missing percent text(textvar [, marker_label_options]) twoway_options ]
Description
spineplot produces a spine plot for two-way categorical data. The fractional breakdown of the categories of the first-named variable yvar is shown for each category of the second-named variable xvar. Stacked bars are drawn with vertical extent showing fraction(yvar category | xvar category) and horizontal extent showing fraction in xvar category. Thus the areas of tiles formed represent the frequencies, or more generally totals, for each cross-combination of yvar and xvar.
fweights and aweights may be specified.
Remarks
The name "spine plot" is due to Hummel (1996). The term is not yet widespread but appears already to be variously understood. Textbooks and monographs with examples of spine plots and related plots include Friendly (2000), Venables and Ripley (2002), Robbins (2005), Unwin et al. (2006) and Young et al. (2006). Among several papers, Hofmann's (2000) discussion is clear, concise and well-illustrated.
Some literature treats spine plots, as understood here, under the heading of mosaic plots, variously with and without also using the term spine plot. The original definition of spine plots appears to have allowed at most vertical subdivision (e.g. by highlighting) into two categories, but this stipulation is also widely ignored as being unduly restrictive. The Stata implementation here under the name spineplot thus implies a broad interpretation of the term. Conversely, this implementation does not purport to be a general mosaic plot program.
Mosaic plots have been re-invented several times under different names. Hartigan and Kleiner (1981, 1984) introduced, or re-introduced, them into mainstream statistics. Friendly (2002) cites earlier examples, including the work of Georg von Mayr (1877), Karl G. Karsten (1923) and Erwin J. Raisz (1934). Hofmann (2007) discusses a mosaic by Francis A. Walker (1874). Other early examples are those of Willard C. Brinton (1914, quoting earlier work) and Berend G. Escher (1934).
Most implementations of mosaic plots omit axes and numerical scales and convey a recursive subdivision according to what may be several categorical variables by a hierarchy of gaps of various sizes. As the plot here is restricted to two variables, this Stata implementation keeps axes and numerical scales as defaults. The distinction between categories is conveyed by bar boundaries rather than explicit gaps.
A key principle behind any kind of mosaic plot is that a categorical classification of independent variables would yield tiles that align consistently. Thus departures from independence, or relationships between variables, will be shown by failure of alignment.
The restriction to two variables is more apparent than real. Composite variables may be created by cross-combination of two or more categorical variables. The egen functions group() and axis() may be useful for this purpose. axis() is in the egenmore package from SSC and must have been installed previously. Compare also what Wilkinson (2005) calls "region trees" (and his references).
The program works by calculating cumulative frequencies. The plot is then produced by overlaying distinct graphs, each a call to twoway bar, bartype(spanning) for one category of yvar. By default, each bar is shown with blcolor(bg) blw(medium), which should be sufficient to outline each bar distinctly but delicately. By default also, the categories of yvar will be distinguished according to the graph scheme you are using. With the default s2color scheme the effect is reminiscent of tinned fruit salad, which may be fine for exploratory work. For a publishable graph you might want to reach for something more subdued, such as various grey scales.
Options bar1() to bar20() are provided to allow overriding the defaults on up to 20 categories, the first, second, etc., shown. The limit of 20 is plucked out of the air as more than any user should really want. The option barall() is available to override the defaults for all bars. Any bar? option always overrides barall(). Thus if you wanted thicker blwidth() on all bars you could specify barall(blwidth(thick)). If you wanted to highlight the first category only you could specify bar1(blwidth(thick)).
Other defaults include legend(col(1) pos(3)). At least with s2color a legend on the right implies an approximately square plot region which can look quite good. A legend is supplied partly because there is no guarantee that all yvar categories will be represented for extreme categories of xvar. However, it will often be possible and tasteful to omit the legend and show categories as axis label text. An example is given below.
Note the possibility of using plotregion(margin(zero)) to place axes alongside the plot region.
As with scatter plots, a response variable is usually better shown on the y axis. If one variable is binary, it is often better to plot that on the y axis. Naturally, there can be some tension between these suggestions. For example, in the auto data, foreign is arguably a predictor of rep78 rather than vice versa, but I suggest that spineplot foreign rep78 is more congenial than spineplot rep78 foreign.
The user may need to experiment with different sort orders for the categorical variables. egen, axis() may again be useful here.
Options
bar1(twoway_bar_options) ... bar20(twoway_bar_options) allow specification of the appearance of the bars for each category of yvar using options of twoway bar.
barall(twoway_bar_options) allows specification of the appearance of the bars for all categories of yvar using options of twoway bar.
missing specifies that any missing values of either of the variables specified should also be included within their own categories. The default is to omit them.
percent specifies labelling in terms of percents. The default is labelling in terms of fractions.
text(textvar [, marker_label_options]) specifies a variable to be shown as text at the centre of each tile. textvar may be a numeric or string variable. It should contain identical values for all observations in each cross-combination of yvar and xvar. A simple example is the frequency of each cross-combination. To show nothing in particular tiles, use a variable with missing values, either numeric missing or empty strings, for those tiles. A numeric variable with fractional part will typically look best converted to string as (e.g.) string(residual, "%4.3f"). Choice of tile colours so that text is readable is the user's responsibility. text() may also include marker label options tuning the display.
twoway_options refers to options of twoway. Note that by default there are two x axes, axis(1) on top and axis(2) on bottom, and two y axes, axis(1) on right and axis(2) on left.
Examples
. sysuse auto . spineplot foreign rep78 . spineplot foreign rep78, xti(frequency, axis(1)) xla(0(10)60, axis(1)) xmti(1/69, axis(1)) . spineplot rep78 foreign
. set scheme s1color . bysort foreign rep78: gen freq = _N . spineplot foreign rep78, text(freq, mlabsize(*1.4)) bar1(color(gs14)) bar2(color(gs10)) . spineplot foreign rep78, text(freq, mlabsize(*1.4)) bar1(color(gs14)) bar2(color(gs10)) legend(off) yla(0.1 "Domestic" 0.9 "Foreign", noticks axis(1))
Author
Nicholas J. Cox, Durham University n.j.cox@durham.ac.uk
Acknowledgments
Matthias Schonlau, Scott Merryman and Maarten Buis provoked this program through challenging Statalist postings, which re-awakened a long-standing thought that someone, perhaps me, should implement spine plots in Stata. A suggestion from Peter Jepsen led to the text() option. Private emails from Matthias Schonlau and Antony Unwin highlighted different senses of spine plots and the importance of sort order. Vince Wiggins originally told me about the undocumented bartype(spanning) option.
References
Anderson, M.J. 2001. Francis Amasa Walker. In Heyde, C.C. and Seneta, E. (eds) Statisticians of the centuries. New York: Springer, 216-218.
Brinton, W.C. 1914. Graphic methods for presenting facts. New York: Engineering Magazine Company.
Escher, B.G. 1934. De methodes der grafische voorstelling. Amsterdam: Wereldbibliotheek. [first edition 1924]
Friendly, M. 2000. Visualizing categorical data. Cary, NC: SAS Institute.
Friendly, M. 2002. A brief history of the mosaic display. Journal of Computational and Graphical Statistics 11: 89-107.
Hartigan, J.A. and Kleiner, B. 1981. Mosaics for contingency tables. In Eddy, W.F. (ed.) Computer science and statistics: Proceedings of the 13th symposium on the interface. New York: Springer, 268-273.
Hartigan, J.A. and Kleiner, B. 1984. A mosaic of television ratings. American Statistician 38: 32-35.
Hertz, S. 2001. Georg von Mayr. In Heyde, C.C. and Seneta, E. (eds) Statisticians of the centuries. New York: Springer, 219-222.
Hofmann, H. 2000. Exploring categorical data: interactive mosaic plots. Metrika 51: 11-26.
Hofmann, H. 2007. Interview with a centennial chart. Chance 20(2): 26-35.
Hummel, J. 1996. Linked bar charts: analysing categorical data graphically. Computational Statistics 11: 23-33.
Karsten, K.G. 1923. Charts and graphs: An introduction to graphic methods in the control and analysis of statistics. New York: Prentice-Hall.
Mayr, G. von. 1877. Die Gesetzmässigkeit im Gesellschaftleben. München: Oldenbourg.
Raisz, E.J. 1934. The rectangular statistical cartogram. Geographical Review 24: 292-296.
Robbins, N.B. 2005. Creating more effective graphs. Hoboken, NJ: John Wiley.
Robinson, A.H. 1970. Erwin Josephus Raisz, 1893-1968. Annals, Association of American Geographers 60: 189-193.
Unwin, A., Theus, M. and Hofmann, H. 2006. Graphics of large datasets: Visualizing a million. New York: Springer.
Venables, W.N. and Ripley, B.D. 2002. Modern applied statistics with S. New York: Springer.
Walker, F.A. 1874. Statistical atlas of the United States based on the results of the ninth census 1870. New York: Census Office.
Wilkinson, L. 2005. The grammar of graphics. New York: Springer.
Young, F.W., Valero-Mora, P.M. and Friendly, M. 2006. Visual statistics: Seeing data with interactive graphics. Hoboken, NJ: John Wiley.
Also see
On-line: help for histogram, help for catplot (if installed), help for tabplot (if installed), help for egenmore (if installed), help for vreverse (if installed)