------------------------------------------------------------------------------- help fordistplot-------------------------------------------------------------------------------

Distribution function plots

distplotplottypevarname[weight] [ifexp] [inrange] [,by(byvar){frequency|midpoint}missingreversetrscale(transformation_syntax)graph_options]

distplotplottypevarlist[weight] [ifexp] [inrange] [,{frequency|midpoint}reversetrscale(transformation_syntax)graph_options]

Description

distplotproduces a plot of the cumulative distribution function(s) for the variables invarlist. This shows the proportion (or if desired the frequency) of values less than or equal to each value.The plot may be one of eight twoway types, namely,

area,bar,connected,dot,dropline,line,scatterorspike. Theplottypemust be specified.With the

reverseoption,distplotproduces a plot of the reverse cumulative probabilities (or frequencies), a.k.a., or a multiple of, the complementary distribution, reliability, survival or survivor function. This shows the proportion (or if desired the frequency) of values greater than each value.

fweights andaweights may be specified.

Options

by()specifies that calculations are to be carried out separately for each distinct value of a single variablebyvar.by()is only allowed with a singlevarname.

frequencyspecifies calculation of cumulative frequency rather than cumulative probability.

midpointspecifies the use of midpoints of cumulative probability for each distinct value. This is especially appropriate for showing distributions of graded (ordinal) data with a relatively small number of categories. For more explanation and examples, see the Appendix below.

frequencyandmidpointmay not be combined.

missing, used only withby(), permits the use of non-missing values ofvarnamecorresponding to missing values for the variable named byby(). The default is to ignore such values.

trscale()specifies the use of an alternative transformed scale for cumulative probabilities (or frequencies) on the graph. Stata syntax should be used with@as placeholder for untransformed values. To show probabilities as percents, specifytrscale(100 * @). To show probabilities on an inverse normal scale, specifytrscale(invnorm(@)); on a logit scale, specifytrscale(logit(@)); on a folded root scale, specifytrscale(sqrt(@) - sqrt(1 - @)); on a loglog scale, specifytrscale(-log(-log(@))); on a cloglog scale, specifytrscale(log(-log(1 - @))). Tools to make associated labels and ticks easier are available on SSC: see ssc desc mylabels.

graph_optionsrefers to options of graph appropriate to theplottypespecified.

Examples. distplot scatter mpg . distplot line mpg, by(foreign) clp(l _)

. distplot connected length width height

To sample all possible

plottypes:. foreach t in area bar connected dot dropline line scatter spike { . distplot `t' mpg, by(foreign) . }

AuthorNicholas J. Cox, University of Durham n.j.cox@durham.ac.uk

AcknowledgmentsElizabeth Allred, Ronan Conroy and Roger Harbord made helpful comments during development of earlier versions of this or related programs.

Also seeOn-line: help for graph, cumul, quantile, mylabels (if installed)

Manual:

[R] cumul,[R] diagnostic plots

Appendix: the midpoint option and graded dataThe cumulative probability

Pis defined under themidpointoption asSUM counts in categories below + (1/2) count in this category -------------------------------------------------------------. SUM counts in all categories With terminology from Tukey (1977, pp.496-497), this could be called a `split fraction below'. It is also a `ridit' as defined by Bross (1958): see also Fleiss (1981, pp.150-7) or Flora (1988). Yet again, it is also the mid-distribution function of Parzen (1993, p.3295). The numerator is a `split count'. Using this numerator, rather than

SUM counts in categories below

or

SUM counts in categories below + count in this category, means that more use is made of the information in the data. Either alternative would always mean that some probabilities are identically 0 or 1, which tells us nothing about the data. In addition, there are fewer problems in showing the cumulative distribution on any transformed scale (e.g. logit) for which the transform of 0 or 1 is not plottable. Using this approach for graded data was suggested by Cox (2001).

A plot of the complement of this cumulative probability, 1 -

P, may be obtained through thereverseoption.Further information on working with counted fractions and folded transformations for probability scales is available in Tukey (1960, 1961, 1977), Atkinson (1985), Cox and Snell (1989) and Emerson (1991). Some of the possible transformations appear as link functions in the literature on generalized linear models (e.g. McCullagh and Nelder 1989; Aitkin et al. 1989).

Example 1Aitkinet al.(1989, p.242) reported data from a survey of student opinion on the Vietnam War taken at the University of North Carolina in Chapel Hill in May 1967. Students were classified by sex, year of study and the policy they supported, given choices ofA The US should defeat the power of North Vietnam by widespread bombing of its industries, ports and harbours and by land invasion.

B The US should follow the present policy in Vietnam.

C The US should de-escalate its military activity, stop bombing North Vietnam, and intensify its efforts to begin negotiation.

D The US should withdraw its military forces from Vietnam immediately.

(They also report response rates (p.243), averaging 26% for males and 17% for females.)

Suppose that, underneath the labels below, the value labels of

sexare also calledsex.sex year policy freq 1. male 1 A 175 2. male 1 B 116 3. male 1 C 131 4. male 1 D 17 5. male 2 A 160 6. male 2 B 126 7. male 2 C 135 8. male 2 D 21 9. male 3 A 132 10. male 3 B 120 11. male 3 C 154 12. male 3 D 29 13. male 4 A 145 14. male 4 B 95 15. male 4 C 185 16. male 4 D 44 17. male Graduate A 118 18. male Graduate B 176 19. male Graduate C 345 20. male Graduate D 141 21. female 1 A 13 22. female 1 B 19 23. female 1 C 40 24. female 1 D 5 25. female 2 A 5 26. female 2 B 9 27. female 2 C 33 28. female 2 D 3 29. female 3 A 22 30. female 3 B 29 31. female 3 C 110 32. female 3 D 6 33. female 4 A 12 34. female 4 B 21 35. female 4 C 58 36. female 4 D 10 37. female Graduate A 19 38. female Graduate B 27 39. female Graduate C 128 40. female Graduate D 13

. distplot connected policy [w=freq] if sex=="male":sex, mid by(year) xla(, valuelabel) . distplot connected policy [w=freq] if sex=="female":sex, mid by(year) xla(, valuelabel)

Example 2Fienberg (1980, pp.54-55) reports data from Duncan, Schuman and Duncan (1973) from 1959 and 1971 surveys of a large American city asking "Are the radio and TV networks doing a good job, just a fair job, or a poor job?". Suppose that, underneath the labels below,opinionruns 1/3.group opinion freq 1. 1959 Black Good 81 2. 1959 Black Fair 23 3. 1959 Black Poor 4 4. 1959 White Good 325 5. 1959 White Fair 253 6. 1959 White Poor 54 7. 1971 Black Good 224 8. 1971 Black Fair 144 9. 1971 Black Poor 24 10. 1971 White Good 600 11. 1971 White Fair 636 12. 1971 White Poor 158

. tab group opinion [w=freq], row . mylabels 20(10)90 95 98 99, myscale(logit(@/100)) local(myla) . distplot connected opinion [w=freq], mid by(group) trscale(logit(@)) xla(1/3, valuelabel) yla(`myla', ang(h)) ytitle(Percent)

This shows a clear shift of opinion towards Poor from 1959 to 1971, and a narrowing gap between Black and White.

Example 3Clogg and Shihadeh (1994, p.156) give data from the 1988 General Social Survey on answers to the question "When a marriage is troubled and unhappy, do you think it is generally better for the children if the couple stays together or gets divorced?". Responses "much better to divorce", "better to divorce", "don't know", "worse to divorce" and "much worse to divorce" were coded here as 1/5 with shorter value labels.sex opinion freq 1. male much better 84 2. male better 205 3. male don't know 135 4. male worse 121 5. male much worse 56 6. female much better 154 7. female better 330 8. female don't know 178 9. female worse 72 10. female much worse 49

It is not clear that the "don't know"s belong in the middle of the scale. The point can be explored by graphs with and without those values. Either way, there is a distinct separation between males and females, and a logit scale gives a more nearly linear pattern.

. distplot connected opinion [w=freq], mid by(sex) xla(, valuelabel) xsc(r(0.7,5.3)) . mylabels 2 5 10(10)90 95 98, myscale(logit(@/100)) local(myla) . distplot connected opinion [w=freq], mid by(sex) xla(, valuelabel) xsc(r(0.7,5.3)) trscale(logit(@)) yla(`myla', ang(h)) ytitle(Percent) . egen opinion2 = group(opinion) if opinion != 3, label . distplot connected opinion2 [w=freq], mid by(sex) xla(, valuelabel) xsc(r(0.7,4.3)) trscale(logit(@)) yla(`myla', ang(h)) ytitle(Percent)

Example 4Knoke and Burke (1980, p.68) gave data from the 1972 General Social Survey on church attendance. Suppose that, underneath the labels below,attendruns 1/3.group attend freq 1. young non-Catholic low 322 2. young non-Catholic medium 122 3. young non-Catholic high 141 4. old non-Catholic low 250 5. old non-Catholic medium 152 6. old non-Catholic high 194 7. young Catholic low 88 8. young Catholic medium 45 9. young Catholic high 106 10. old Catholic low 28 11. old Catholic medium 24 12. old Catholic high 119

The

reverseoption ensures that higher attendance groups plot higher on the graph. There are clear age and denomination effects and an indication of an interaction between the two.. mylabels 0.05 0.1(0.2)0.9 0.95, myscale(logit(@)) local(myla) . distplot connected attend [w=freq], mid by(group) trscale(logit(@)) reverse xla(1/3, valuelabel) yla(`myla', ang(h))

Example 5Box, Hunter and Hunter (1978, pp.145-9) gave data on five hospitals on the degree of restoration (no improvement, partial functional restoration, complete functional restoration) of certain joints impaired by disease effected by a certain surgical procedure. (It is not clear whether these data are real.) Hospital E is a referral hospital. Boxet al.carry out chi-square analyses, focusing on the difference between Hospital E and the others. Suppose that, underneath the labels below,restoreruns 1/3.hospital restore freq 1. A none 13 2. B none 5 3. C none 8 4. D none 21 5. E none 43 6. A partial 18 7. B partial 10 8. C partial 36 9. D partial 56 10. E partial 29 11. A complete 16 12. B complete 16 13. C complete 35 14. D complete 51 15. E complete 10

. mylabels 5 10(20)90 5, myscale(logit(@/100)) local(myla) . distplot connected restore [w=freq], mid by(hospital) tsc(logit(@)) xla(1/3, valuelabel) xsc(r(0.9,3.1)) yla(`myla', ang(h)) ytitle(Percent)

ReferencesAitkin, M., Anderson, D., Francis, B. and Hinde, J. 1989.

Statisticalmodelling in GLIM.Oxford: Oxford University Press.Atkinson, A.C. 1985.

Plots, transformations, and regression.Oxford: Oxford University Press.Box, G.E.P., Hunter, W.G. and Hunter, J.S. 1978.

Statistics forexperimenters: an introduction to design, data analysis, and modelbuilding.New York: John Wiley.Bross, I.D.J. 1958. How to use ridit analysis.

Biometrics14: 38-58.Clogg, C.C. and Shihadeh, E. 1994.

Statistical models for ordinalvariables.Thousand Oaks, CA: Sage.Cox, D.R. and Snell, E.J. 1989.

Analysis of binary data.London: Chapman and Hall.Cox, N.J. 2001. Plotting graded data: a Tukey-ish approach. Presentation to UK Stata users meeting, Royal Statistical Society, London, 14-15 May. http://www.stata.com/support/meeting/7uk/cox1.pdf

Duncan, O.D., Schuman, H. and Duncan, B. 1973.

Social change in ametropolitan community.New York: Russell Sage Foundation.Emerson, J.D. 1991. Introduction to transformation. In Hoaglin, D.C., Mosteller, F. and Tukey, J.W. (eds)

Fundamentals of exploratory analysisof variance.New York: John Wiley, 365-400.Fienberg, S.E. 1980.

The analysis of cross-classified categorical data.Cambridge, MA: MIT Press.Fleiss, J.L. 1981.

Statistical methods for rates and proportions.New York: John Wiley.Flora, J.D. 1988. Ridit analysis. In Kotz, S. and Johnson, N.L. (eds)

Encyclopedia of statistical sciences.Wiley, New York, 8, 136-139.Knoke, D. and Burke, P.J. 1980.

Log-linear models.Beverly Hills, CA: Sage.McCullagh, P. and Nelder, J.A. 1989.

Generalized linear models.London: Chapman and Hall.Parzen, E. 1993. Change

PPplot and continuous sample quantile function.Communications in Statistics, Theory and Methods22: 3287-3304.Tukey, J.W. 1960. The practical relationship between the common transformations of percentages or fractions and of amounts. Reprinted in Mallows, C.L. (ed.) 1990.

The collected works of John W. Tukey. VolumeVI: More mathematical.Pacific Grove, CA: Wadsworth & Brooks-Cole, 211-219.Tukey, J.W. 1961. Data analysis and behavioral science or learning to bear the quantitative man's burden by shunning badmandments. Reprinted in Jones, L.V. (ed.) 1986.

The collected works of John W. Tukey. VolumeIII: Philosophy and principles of data analysis: 1949-1964.Monterey, CA: Wadsworth & Brooks-Cole, 187-389.Tukey, J.W. 1977.

Exploratory data analysis.Reading, MA: Addison-Wesley.