{smcl} {* 12feb2008/27apr2016}{...} {hi:help spineplot}{right: ({browse "http://www.stata-journal.com/article.html?article=gr0031_1":SJ16-2: gr0031_1})} {hline} {title:Title} {p2colset 5 18 20 2}{...} {p2col :{hi:spineplot} {hline 2}}Spineplots for two-way categorical data{p_end} {title:Syntax} {p 8 17 2} {cmd:spineplot} {it:yvar} {it:xvar} {ifin} {weight} [{cmd:,} {cmd:bar1(}{it:twoway_bar_options}{cmd:)} ... {cmd:bar20(}{it:twoway_bar_options}{cmd:)} {cmd:barall(}{it:twoway_bar_options}{cmd:)} {cmdab:miss:ing} {cmdab:perc:ent} {cmd:text(}{it:textvar} [{cmd:,} {it:marker_label_options}]{cmd:)} {it:twoway_options}] {p 4 4 2} {cmd:fweight}s and {cmd:aweight}s may be specified; see {help weight}. {title:Description} {p 4 4 2} {cmd:spineplot} produces a spineplot for two-way categorical data. The fractional breakdown of the categories of the first-named variable {it:yvar} is shown for each category of the second-named variable {it:xvar}. Stacked bars are drawn with vertical extent showing fraction in each {it:yvar} category given each {it:xvar} category and horizontal extent showing fraction in {it:xvar} category. Thus the areas of tiles formed represent the frequencies, or more generally totals, for each cross-combination of {it:yvar} and {it:xvar}. {title:Options} {p 4 8 2} {cmd:bar1(}{it:twoway_bar_options}{cmd:)} ... {cmd:bar20(}{it:twoway_bar_options}{cmd:)} allow specification of the appearance of the bars for each category of {it:yvar} using options of {helpb twoway bar}. {p 4 8 2} {cmd:barall(}{it:twoway_bar_options}{cmd:)} allows specification of the appearance of the bars for all categories of {it:yvar} using options of {helpb twoway bar}. {p 4 8 2} {cmd:missing} specifies that any missing values of either of the variables specified should also be included within their own categories. The default is to omit them. {p 4 8 2} {cmd:percent} specifies labeling as percentages. The default is labeling as fractions. {p 4 8 2} {cmd:text(}{it:textvar} [{cmd:,} {it:marker_label_options}]{cmd:)} specifies a variable to be shown as text at the center of each tile. {it:textvar} may be a numeric or string variable. It should contain identical values for all observations in each cross-combination of {it:yvar} and {it:xvar}. A simple example is the frequency of each cross-combination. To show nothing in particular tiles, use a variable with missing values (either numeric missing or empty strings) for those tiles. A numeric variable with fractional part will typically look best converted to string as, for example, {cmd:string(}{it:residual}{cmd:, "%4.3f")}. The user is responsible for choice of tile colors so that text is readable. {cmd:text()} may also include {it:{help marker_label_options}} for tuning the display. {p 4 8 2} {it:twoway_options} refers to options of {helpb twoway}. By default there are two x axes, {cmd:axis(1)} on top and {cmd:axis(2)} on bottom, and two y axes, {cmd:axis(1)} on right and {cmd:axis(2)} on left. {title:Remarks} {p 4 4 2} The name "spineplot" is credited to Hummel (1996). The term is gaining in popularity but already appears to be differently understood. In the strictest definition, spineplots are one-dimensional, horizontal stacked bar charts, but many discussions and implementations allow vertical subdivision (e.g., by highlighting) into two or possibly more categories. Some literature treats spineplots, as understood here, under the heading of mosaic plots, variously with and without also using the term spineplot. This Stata implementation under the name {cmd:spineplot} thus implies a broad interpretation of the term. Conversely, the implementation here does not purport to be a general mosaic plot program. {p 4 4 2}Textbooks and monographs with examples of spineplots and related plots include Schmid (1954), Cole (1959), Edwards (1972, 1992), Ehrenberg (1975), Lockwood (1979), Schmid and Schmid (1979), Altman (1991), Friendly (2000), Venables and Ripley (2002), Gotelli and Ellison (2004, 2013), Robbins (2005, 2013), Unwin, Theus, and Hofmann (2006), Young, Valero-Mora, and Friendly (2006), Cook and Swayne (2007), Unwin (2015) and Friendly and Meyer (2016). Among several papers, Hofmann's (2000) discussion is clear, concise, and well illustrated. {p 4 4 2} Mosaic plots have been reinvented several times under different names. Hartigan and Kleiner (1981, 1984) introduced, or reintroduced, them into mainstream statistics. Friendly (2002) cites earlier examples, including the work of Georg von Mayr (1877), Karl G. Karsten (1923), Erwin J. Raisz (1934) and Thomas W. Birch (1949). Hofmann (2007) discusses a mosaic by Francis A. Walker (1874). Other early examples are those of Willard C. Brinton (1914, quoting earlier work; 1939), Berend G. Escher (1924) and Hans Zeisel (1947, 1985). {p 4 4 2} Most implementations of mosaic plots in other software omit axes and numerical scales and convey a recursive subdivision according to what may be several categorical variables by a hierarchy of gaps of various sizes. As the graphs produced by {cmd:spineplot} are restricted to two variables, this Stata implementation keeps axes and numerical scales as defaults. The distinction between categories is conveyed by bar boundaries rather than explicit gaps. {p 4 4 2} A key principle behind any kind of mosaic plot is that a categorical classification of independent variables would yield tiles that align consistently. Thus departures from independence, or relationships between variables, will be shown by failure of alignment. {p 4 4 2} The restriction to two variables is more apparent than real. Composite variables may be created by cross-combination of two or more categorical variables. The {helpb egen} functions {cmd:group()} and {cmd:axis()} may be useful for this purpose. {cmd:axis()} is in the {cmd:egenmore} package from the Statistical Software Components archive and must have been installed previously. Compare also what Hofmann (2001) calls "double-decker plots" (for binary responses) and what Wilkinson (2005) calls "region trees". {p 4 4 2} The program works by calculating cumulative frequencies. The plot is then produced by overlaying distinct graphs, each being a call to {cmd:twoway bar, bartype(spanning)} for one category of {it:yvar}. By default, each bar is shown with {cmd:blcolor(bg) blw(medium)}, which should be sufficient to outline each bar distinctly but delicately. By default also, the categories of {it:yvar} will be distinguished according to the graph scheme you are using. With the default {cmd:s2color} scheme, the effect is reminiscent of canned fruit salad (which may be fine for exploratory work). For a publishable graph, you might want to use something more subdued, such as various gray scales or different intensities. {p 4 4 2} Options {cmd:bar1()} to {cmd:bar20()} are provided to allow overriding the defaults on up to 20 categories, the first, second, etc., shown. The limit of 20 is plucked out of the air as more than any user should really want. The option {cmd:barall()} is available to override the defaults for all bars. Any {cmd:bar}{it:#}{cmd:()} option always overrides {cmd:barall()}. Thus, if you wanted thicker {cmd:blwidth()} on all bars, you could specify {cmd:barall(blwidth(thick))}. If you wanted to highlight the first category only, you could specify {cmd:bar1(blwidth(thick))} or a particular color. {p 4 4 2} Other defaults include {cmd:legend(col(1) pos(3))}. At least with {cmd:s2color}, a legend on the right implies an approximately square plot region, which can look quite good. A legend is supplied partly because there is no guarantee that all {it:yvar} categories will be represented for extreme categories of {it:xvar}. However, it will often be possible and tasteful to omit the legend and show categories as axis label text. An example is given below. {p 4 4 2} Note the possibility of using {cmd:plotregion(margin(zero))} to place axes alongside the plot region. {p 4 4 2} As with scatterplots, a response variable is usually better shown on the {it:y} axis. If one variable is binary, it is often better to plot that on the {it:y} axis. Naturally, there can be some tension between these suggestions. For example, in the auto data, {cmd:foreign} is arguably a predictor of {cmd:rep78} rather than vice versa, but I suggest that {cmd:spineplot foreign rep78} is more congenial than {cmd:spineplot rep78 foreign}. {p 4 4 2} You may need to experiment with different sort orders for the categorical variables. {cmd:egen, axis()} may be useful here. {p 4 4 2} The {it:x} axis labels on the bottom axis ({cmd:axis(2)}) are placed below the middle of each column. As a convenience to users wishing to override the defaults, the specification is saved as {cmd:r(catlabels)}, so that the command may be repeated with revised positions and/or text. Type {cmd:return list} to see the specification. {title:Examples} {p 4 8 2}{cmd:. sysuse auto}{p_end} {p 4 8 2}{cmd:. spineplot foreign rep78}{p_end} {p 4 8 2}{cmd:. spineplot foreign rep78, xti(frequency, axis(1)) xla(0(10)60, axis(1)) xmti(1/69, axis(1))}{p_end} {p 4 8 2}{cmd:. spineplot rep78 foreign} {p 4 8 2}{cmd:. set scheme s1color}{p_end} {p 4 8 2}{cmd:. bysort foreign rep78: gen freq = _N}{p_end} {p 4 8 2}{cmd:. spineplot foreign rep78, text(freq, mlabsize(*1.4)) bar1(color(gs14)) bar2(color(gs10))}{p_end} {p 4 8 2}{cmd:. spineplot foreign rep78, text(freq, mlabsize(*1.4)) bar1(color(gs14)) bar2(color(gs10)) legend(off) yla(0.1 "Domestic" 0.9 "Foreign", noticks axis(1))}{p_end} {title:References} {p 4 8 2} Altman, D. G. 1991. {it:Practical Statistics for Medical Research.} London: Chapman & Hall. {p 4 8 2} Anderson, M. J. 2001. Francis Amasa Walker. In {it:Statisticians of the Centuries}, ed. C. C. Heyde and E. Seneta, 216-218. New York: Springer. {p 4 8 2} Anonymous. 1967. In memoriam: Prof. Dr. B. G. Escher. {it:Geologie en Mijnbouw} 46: 417-422. {p 4 8 2} Birch, T. W. 1949. {it:Maps: Topographical and Statistical.} London: Oxford University Press. {p 4 8 2} Brinton, W. C. 1914. {it:Graphic Methods for Presenting Facts}. New York: Engineering Magazine Company. {p 4 8 2} Brinton, W. C. 1939. {it:Graphic Presentation.} New York: Brinton Associates. {browse "http://www.archive.org/stream/graphicpresentat00brinrich":http://www.archive.org/stream/graphicpresentat00brinrich} {p 4 8 2} Cole, J. P. 1959. {it:Geography of World Affairs.} Harmondsworth: Penguin. {p 4 8 2} Cook, D. and D. F. Swayne. 2007. {it:Interactive and Dynamic Graphics for Data Analysis: With R and GGobi}. New York: Springer. {p 4 8 2} Edwards, A. W. F. 1972, revised 1992. {it:Likelihood: An Account of the Statistical Concept of Likelihood and its Application to Statistical Inference.} London: Cambridge University Press; Baltimore: Johns Hopkins University Press. {* NJC confirms London as place of first publication} {p 4 8 2} Ehrenberg, A. S. C. 1975. {it:Data Reduction: Analysing and Interpreting Statistical Data.} London: John Wiley. {* NJC confirms London as place of first publication} {p 4 8 2} Escher, B. G. 1924. {it:De Methodes der Grafische Voorstelling}. Amsterdam: Wereldbibliotheek. {p 4 8 2} Escher, B. G. 1934. {it:De Methodes der Grafische Voorstelling}. 2nd ed. Amsterdam: Wereldbibliotheek. {p 4 8 2} Friendly, M. 2000. {it:Visualizing Categorical Data}. Cary, NC: SAS Institute. {p 4 8 2} Friendly, M. 2002. A brief history of the mosaic display. {it:Journal of Computational and Graphical Statistics} 11: 89{c -}107. {p 4 8 2} Friendly, M. and D. Meyer. 2016. {it:Discrete Data Analysis with R: Visualization and Modeling Techniques for Categorical and Count Data.} Boca Raton, FL: CRC Press. {p 4 8 2} Gotelli, N. J. and A. M. Ellison. 2004 (2nd edition 2013). {it:A Primer of Ecological Statistics.} Sunderland, MA: Sinauer. {p 4 8 2} Hartigan, J. A. and B. Kleiner. 1981. Mosaics for contingency tables. In {it:Computer Science and Statistics: Proceedings of the 13th Symposium} {it:on the Interface}, ed. W. F. Eddy, 268-273. New York: Springer. {p 4 8 2} Hartigan, J. A. and B. Kleiner. 1984. A mosaic of television ratings. {it:American Statistician} 38: 32-35. {p 4 8 2} Hertz, S. 2001. Georg von Mayr. In {it:Statisticians of the Centuries}, ed. C. C. Heyde and E. Seneta, 219-222. New York: Springer. {p 4 8 2} Hofmann, H. 2000. Exploring categorical data: Interactive mosaic plots. {it:Metrika} 51: 11-26. {p 4 8 2} Hofmann, H. 2001. Generalized odds ratios for visual modeling. {it:Journal of Computational and Graphical Statistics} 10: 628-640. {p 4 8 2} Hofmann, H. 2007. Interview with a centennial chart. {it:Chance} 20(2): 26-35. {p 4 8 2} Hummel, J. 1996. Linked bar charts: Analyzing categorical data graphically. {it:Computational Statistics} 11: 23-33. {p 4 8 2} Karsten, K. G. 1923. {it:Charts and Graphs: An Introduction to Graphic Methods in the Control and Analysis of Statistics.} New York: Prentice-Hall. {p 4 8 2} Lockwood, A. 1969. {it:Diagrams: A Visual Survey of Graphs, Maps, Charts and Diagrams for the Graphic Designer.} London: Studio Vista. {p 4 8 2} Raisz, E. J. 1934. The rectangular statistical cartogram. {it:Geographical Review} 24: 292-296. {p 4 8 2} Robbins, N. B. 2005 (reissued 2013). {it:Creating More Effective Graphs}. Hoboken, NJ: Wiley; Wayne, NJ: Chart House. {p 4 8 2} Robinson, A. H. 1970. Erwin Josephus Raisz, 1893-1968. {it:Annals of the Association of American Geographers} 60: 189-193. {p 4 8 2} Schmid, C. F. 1954. {it:Handbook of Graphic Presentation.} New York: Ronald Press. {p 4 8 2} Schmid, C. F. and Schmid, S. E. 1979. {it:Handbook of Graphic Presentation.} New York: John Wiley. {p 4 8 2} Sills, D. L. 1992. In Memoriam: Hans Zeisel, 1905-1992. {it:Public Opinion Quarterly} 56: 536-537. {p 4 8 2} Unwin, A. 2015. {it:Graphical Data Analysis with R.} Boca Raton, FL: CRC Press. {p 4 8 2} Unwin, A., M. Theus, and H. Hofmann. 2006. {it:Graphics of Large Datasets: Visualizing a Million}. New York: Springer. {p 4 8 2} Venables, W. N. and B. D. Ripley. 2002. {it:Modern Applied Statistics with S}. New York: Springer. {p 4 8 2} von Mayr, G. 1877. {it:Die Gesetzm{c a:}ssigkeit im Gesellschaftsleben}. M{c u:}nchen: Oldenbourg. {p 4 8 2} Walker, F. A. 1874. {it:Statistical Atlas of the United States Based on the Results of the Ninth Census 1870.} New York: Census Office. {p 4 8 2} Wilkinson, L. 2005. {it:The Grammar of Graphics.} 2nd ed. New York: Springer. {p 4 8 2} Young, F. W., P. M. Valero-Mora, and M. Friendly. 2006. {it:Visual Statistics: Seeing Data with Dynamic Interactive Graphics}. Hoboken, NJ: Wiley. {p 4 8 2} Zeisel, H. 1947. {it:Say It with Figures}. New York: Harper. {p 4 8 2} Zeisel, H. 1985. {it:Say It with Figures}. 6th ed. New York: Harper & Row. {title:Stored results} {synoptset 20 tabbed}{...} {p2col 5 20 24 2: Macros}{p_end} {synopt:{cmd:r(catlabels)}}specification of x axis labels, axis 2{p_end} {p2colreset}{...} {title:Acknowledgments} {p 4 4 2} Matthias Schonlau, Scott Merryman, and Maarten Buis provoked the writing of this program through challenging Statalist postings, which reawakened a long-standing thought that someone, perhaps me, should implement spineplots in Stata. A suggestion from Peter Jepsen led to the {cmd:text()} option. Private emails from Matthias Schonlau and Antony Unwin highlighted different senses of spineplots and the importance of sort order. Antony suggested standardizing on "spineplot" rather than "spine plot". Maarten verified for me that the spineplot in my copy of Escher (1934) also appears in Escher (1924). Vince Wiggins originally told me about the undocumented {cmd:bartype(spanning)} option. {p 4 4 2} Dimitriy V. Masterov prompted revision in 2016 to improve handling of {it:x} axis labels. {title:Author} {p 4 4 2}Nicholas J. Cox, Durham University{break} n.j.cox@durham.ac.uk {title:Also see} {psee} Article: {it:Stata Journal}, volume 8, number 1: {browse "http://www.stata-journal.com/article.html?article=gr0031":gr0031}{break} {psee} Online: {manhelp histogram R}, {helpb catplot} (if installed), {helpb tabplot} (if installed), {helpb egenmore} (if installed), {helpb vreverse} (if installed) {p_end}