------------------------------------------------------------------------------- help forspineplot-------------------------------------------------------------------------------

Spine plots for two-way categorical data

spineplotyvarxvar[weight] [ifexp] [inrange] [,bar1(twoway_bar_options)...bar20(twoway_bar_options)barall(twoway_bar_options)missingpercenttext(textvar[,marker_label_options])twoway_options]

Description

spineplotproduces a spine plot for two-way categorical data. The fractional breakdown of the categories of the first-named variableyvaris shown for each category of the second-named variablexvar. Stacked bars are drawn with vertical extent showing fraction(yvarcategory |xvarcategory) and horizontal extent showing fraction inxvarcategory. Thus the areas of tiles formed represent the frequencies, or more generally totals, for each cross-combination ofyvarandxvar.

fweights andaweights may be specified.

RemarksThe name "spine plot" is due to Hummel (1996). The term is not yet widespread but appears already to be variously understood. Textbooks and monographs with examples of spine plots and related plots include Friendly (2000), Venables and Ripley (2002), Robbins (2005), Unwin et al. (2006) and Young et al. (2006). Among several papers, Hofmann's (2000) discussion is clear, concise and well-illustrated.

Some literature treats spine plots, as understood here, under the heading of mosaic plots, variously with and without also using the term spine plot. The original definition of spine plots appears to have allowed at most vertical subdivision (e.g. by highlighting) into two categories, but this stipulation is also widely ignored as being unduly restrictive. The Stata implementation here under the name

spineplotthus implies a broad interpretation of the term. Conversely, this implementation does not purport to be a general mosaic plot program.Mosaic plots have been re-invented several times under different names. Hartigan and Kleiner (1981, 1984) introduced, or re-introduced, them into mainstream statistics. Friendly (2002) cites earlier examples, including the work of Georg von Mayr (1877), Karl G. Karsten (1923) and Erwin J. Raisz (1934). Hofmann (2007) discusses a mosaic by Francis A. Walker (1874). Other early examples are those of Willard C. Brinton (1914, quoting earlier work) and Berend G. Escher (1934).

Most implementations of mosaic plots omit axes and numerical scales and convey a recursive subdivision according to what may be several categorical variables by a hierarchy of gaps of various sizes. As the plot here is restricted to two variables, this Stata implementation keeps axes and numerical scales as defaults. The distinction between categories is conveyed by bar boundaries rather than explicit gaps.

A key principle behind any kind of mosaic plot is that a categorical classification of independent variables would yield tiles that align consistently. Thus departures from independence, or relationships between variables, will be shown by failure of alignment.

The restriction to two variables is more apparent than real. Composite variables may be created by cross-combination of two or more categorical variables. The egen functions

group()andaxis()may be useful for this purpose.axis()is in theegenmorepackage from SSC and must have been installed previously. Compare also what Wilkinson (2005) calls "region trees" (and his references).The program works by calculating cumulative frequencies. The plot is then produced by overlaying distinct graphs, each a call to

twoway bar,bartype(spanning)for one category ofyvar. By default, each bar is shown withblcolor(bg) blw(medium), which should be sufficient to outline each bar distinctly but delicately. By default also, the categories ofyvarwill be distinguished according to the graph scheme you are using. With the defaults2colorscheme the effect is reminiscent of tinned fruit salad, which may be fine for exploratory work. For a publishable graph you might want to reach for something more subdued, such as various grey scales.Options

bar1()tobar20()are provided to allow overriding the defaults on up to 20 categories, the first, second, etc., shown. The limit of 20 is plucked out of the air as more than any user should really want. The optionbarall()is available to override the defaults for all bars. Anybar? option always overridesbarall(). Thus if you wanted thickerblwidth()on all bars you could specifybarall(blwidth(thick)). If you wanted to highlight the first category only you could specifybar1(blwidth(thick)).Other defaults include

legend(col(1) pos(3)). At least withs2colora legend on the right implies an approximately square plot region which can look quite good. A legend is supplied partly because there is no guarantee that allyvarcategories will be represented for extreme categories ofxvar. However, it will often be possible and tasteful to omit the legend and show categories as axis label text. An example is given below.Note the possibility of using

plotregion(margin(zero))to place axes alongside the plot region.As with scatter plots, a response variable is usually better shown on the

yaxis. If one variable is binary, it is often better to plot that on theyaxis. Naturally, there can be some tension between these suggestions. For example, in the auto data,foreignis arguably a predictor ofrep78rather than vice versa, but I suggest thatspineplotforeign rep78is more congenial thanspineplot rep78 foreign.The user may need to experiment with different sort orders for the categorical variables.

egen, axis()may again be useful here.

Options

bar1(twoway_bar_options)...bar20(twoway_bar_options)allow specification of the appearance of the bars for each category ofyvarusing options of twoway bar.

barall(twoway_bar_options)allows specification of the appearance of the bars for all categories ofyvarusing options of twoway bar.

missingspecifies that any missing values of either of the variables specified should also be included within their own categories. The default is to omit them.

percentspecifies labelling in terms of percents. The default is labelling in terms of fractions.

text(textvar[,marker_label_options])specifies a variable to be shown as text at the centre of each tile.textvarmay be a numeric or string variable. It should contain identical values for all observations in each cross-combination ofyvarandxvar. A simple example is the frequency of each cross-combination. To show nothing in particular tiles, use a variable with missing values, either numeric missing or empty strings, for those tiles. A numeric variable with fractional part will typically look best converted to string as (e.g.)string(residual, "%4.3f"). Choice of tile colours so that text is readable is the user's responsibility.text()may also include marker label options tuning the display.

twoway_optionsrefers to options of twoway. Note that by default there are twoxaxes,axis(1)on top andaxis(2)on bottom, and twoyaxes,axis(1)on right andaxis(2)on left.

Examples

. sysuse auto. spineplot foreign rep78. spineplot foreign rep78, xti(frequency, axis(1)) xla(0(10)60, axis(1))xmti(1/69, axis(1)). spineplot rep78 foreign

. set scheme s1color. bysort foreign rep78: gen freq = _N. spineplot foreign rep78, text(freq, mlabsize(*1.4)) bar1(color(gs14))bar2(color(gs10)). spineplot foreign rep78, text(freq, mlabsize(*1.4)) bar1(color(gs14))bar2(color(gs10)) legend(off) yla(0.1 "Domestic" 0.9 "Foreign",noticks axis(1))

AuthorNicholas J. Cox, Durham University n.j.cox@durham.ac.uk

AcknowledgmentsMatthias Schonlau, Scott Merryman and Maarten Buis provoked this program through challenging Statalist postings, which re-awakened a long-standing thought that someone, perhaps me, should implement spine plots in Stata. A suggestion from Peter Jepsen led to the

text()option. Private emails from Matthias Schonlau and Antony Unwin highlighted different senses of spine plots and the importance of sort order. Vince Wiggins originally told me about the undocumentedbartype(spanning)option.

ReferencesAnderson, M.J. 2001. Francis Amasa Walker. In Heyde, C.C. and Seneta, E. (eds)

Statisticians of the centuries.New York: Springer, 216-218.Brinton, W.C. 1914.

Graphic methods for presenting facts.New York: Engineering Magazine Company.Escher, B.G. 1934.

De methodes der grafische voorstelling.Amsterdam: Wereldbibliotheek. [first edition 1924]Friendly, M. 2000.

Visualizing categorical data.Cary, NC: SAS Institute.Friendly, M. 2002. A brief history of the mosaic display.

Journal ofComputational and Graphical Statistics11: 89-107.Hartigan, J.A. and Kleiner, B. 1981. Mosaics for contingency tables. In Eddy, W.F. (ed.)

Computer science and statistics: Proceedings of the13th symposium on the interface.New York: Springer, 268-273.Hartigan, J.A. and Kleiner, B. 1984. A mosaic of television ratings.

American Statistician38: 32-35.Hertz, S. 2001. Georg von Mayr. In Heyde, C.C. and Seneta, E. (eds)

Statisticians of the centuries.New York: Springer, 219-222.Hofmann, H. 2000. Exploring categorical data: interactive mosaic plots.

Metrika51: 11-26.Hofmann, H. 2007. Interview with a centennial chart.

Chance20(2): 26-35.Hummel, J. 1996. Linked bar charts: analysing categorical data graphically.

Computational Statistics11: 23-33.Karsten, K.G. 1923.

Charts and graphs: An introduction to graphicmethods in the control and analysis of statistics.New York: Prentice-Hall.Mayr, G. von. 1877.

Die Gesetzmässigkeit im Gesellschaftleben.München: Oldenbourg.Raisz, E.J. 1934. The rectangular statistical cartogram.

GeographicalReview24: 292-296.Robbins, N.B. 2005.

Creating more effective graphs.Hoboken, NJ: John Wiley.Robinson, A.H. 1970. Erwin Josephus Raisz, 1893-1968.

Annals,Association of American Geographers60: 189-193.Unwin, A., Theus, M. and Hofmann, H. 2006.

Graphics of large datasets:Visualizing a million.New York: Springer.Venables, W.N. and Ripley, B.D. 2002.

Modern applied statistics with S.New York: Springer.Walker, F.A. 1874.

Statistical atlas of the United States based on theresults of the ninth census 1870.New York: Census Office.Wilkinson, L. 2005.

The grammar of graphics.New York: Springer.Young, F.W., Valero-Mora, P.M. and Friendly, M. 2006.

Visual statistics:Seeing data with interactive graphics.Hoboken, NJ: John Wiley.

Also seeOn-line: help for histogram, help for catplot (if installed), help for tabplot (if installed), help for egenmore (if installed), help for vreverse (if installed)