Title
hangroot -- Hanging rootogram or suspended rootogram comparing an empirical distribution to a theoretcal distribution
Syntax
Stand-alone
hangroot varname [if] [in] [fweight] [, dist(name) general_opts {continuous_opts | discrete_opts} ]
post-estimation command
hangroot [, general_opts {continuous_opts | discrete_opts} ]
The post-estimation syntax is available after estimating a model with or without covariates using one of the commands listed in this table.
general_opts description ------------------------------------------------------------------- dist(name) specifies theoretical distribution par(numlist) specifies the parameters at which the theoretical distribution is to be fixed These parameters will be estimated if this option is not specified suspended specifies that a suspended rootogram is to be drawn rather than a hanging rootogram notheoretical supresses the disply of the theoretical distribution ci draw confidence intervals level(#) sets confidence level to # sims(varlist) overlays the empirical distributions of all variables in varlist spike draw the empirical distribution as spikes, the default bar draw the empirical distribution as bars ninter(#) governs the smoothness of the theoretical distribution if hangroot is used after an estimation command with covariates mainopt(graph_options) options governing the look of the empirical distribution theoropt(graph_options) options governing the look of the theoretical distribution ciopt(graph_options) options governing the look of the confidence interval simsopt(graph_options) options governing the look of the empirical distributions specified in sims() jittersims(integer) horizontally jitter the marker positions for th > e empircal distribution specified in sims() using random noise jitterseed() random number seed for jittersims() plot(plot) add other plots to the graph other options -------------------------------------------------------------------
continuous_opts description ------------------------------------------------------------------- bin(#) set number of bins to # width(#) set width of bins to # start(#) set lower limit of first bin to # -------------------------------------------------------------------
discrete_opts description ------------------------------------------------------------------- discrete specify that the data are discrete width(#) set width of bins to # start(#) set theoretical minimum value to # -------------------------------------------------------------------
Description
hangroot draws a hanging rootogram or suspended rootogram (Tukey 1965, 1972, 1977) and (Tukey and Wilk 1965) (Also see: (Wainer 1974) and (Friendly 2000)) comparing the empirical distribution of varname to a theoretical distribution, as specified in the dist option. Both are an alternative for histogram with the theoretical density function plotted on top. The hanging rootogram differs from a histogram in two ways:
1) The spikes or bars "hang" from the theoretical distribution instead of "standing" on the x-axis. The deviations are now shown as deviations from a horzontal line (y=0) instead of deviations from a curve (the density function). This makes it easier to spot patterns in the deviations.
2) Instead of showing the freqencies it shows the square root of the frequencies. This way the sampling variation of the length of the spikes or bars is stabelized. These lengths are counts of the number of observations that fall within each bin, and larger counts tend to have larger sampling variation than smaller counts, making it harder to compare the deviations across bins. By taking the square root, the sampling variations tends to be approximately equal across bins, facilitating the comparison across bins. Moreover, this tends to make deviations in the tails, where the counts are small, more visible.
The aim of the hanging rootogram and the suspended rootogram is to compare an empirical distribution to a theortical distribution. The key part of the graph that displays this information are the deviations of the bars or spikes in the hanging rootogram from the line y=0, as these are the residuals of the empirical distribution fromt the theoretical distribution. So, a more direct way of achieving the goal of these graphs is to directly display these residuals rather than the raw number of observations belonging to each bin. This is what the suspended rootogram does. It now makes sense to flip the entire graph upside down, suspending the theoretical distribution from the line y=0, because this way positive residuals represent too many observations in a bin, and negative residuals too few observations in a bin. We can optionally suppress the display of the theoretical distribution, focussing entirly on the residuals.
Coefficients of the theoretical distribution
The parameters of the theoretical can either be estimated or specified by the user using the par() option. For many distributions there are two ways in which hangroot can obtain estimates, either it computes those estimates itself, this happens when the stand-alone syntax is used, or it can use the estimates obtained previously, this happens when the post-estimation syntax is used. The post-estimation syntax is available if in the table below there is an entry in the estimation command column, while the stand-alone syntax is available if the entry in the dist(name) column is not marked with an *.
In order to use the post-estimation syntax one must first estimate the model with or without covariates using one of the estimation commands in the table below. If the previous model contained covariates, then the theoretical distribution will the marginal distribution of the dependent variable implied by the model and distribution of the covariates in the data. All the estimation commands are either part of official Stata or available from ssc. The if and in qualifiers and the weights will be coppied from the last estimation command, so they may not be specified in the post-estimation syntax. This also means that this syntax is only availabe if the previous model was estimated either without weights or with fweights.
distribution estimation command dist(name) ------------------------------------------------------------------- normal / Gaussian regress normal or gaussian lognormal lognfit lognormal logistic logistic Weibull weibullfit weibull* Chi square chi2 gamma gammafit gamma Gumbel gumbelfit gumbel inverse gamma invgammafit invgamma Wald / inverse Gaussian invgaussfit wald beta betafit beta Pareto paretofit pareto Fisk / log-logistic fiskfit fisk* Dagum dagumfit dagum* Singh-Maddala smfit sm* Generalized Beta II gb2fit gb2* generalized extreme value gevfit gev* exponential exponential Laplace laplace uniform uniform geometric geometric Poisson poisson poisson zero inflated Poisson zip zip* negative binomial I nbreg nb1* negative binomial II nbreg and gnbreg nb2* zero inflated negative binomial zinb zinb* ------------------------------------------------------------------- * These distributions cannot be used in the stand-alone syntax. If the post-estimation syntax is used and the par() option is not specified, than the best fitting distribution is retrieved from the last estimation command. If the stand-alone syntax is used and the par() option is not specified, than the best fitting distribution is fitted using the method specified in the table below. If the method is not maximum likelihood, the best fitting distribution in the stand-alone syntax may differ from the post-estimation syntax.
maximum method of likelihood moments ----------------------------------------- Gaussian logistic log-normal Gumbel Wald inverse gamma Pareto beta exponential uniform Laplace geometric Poisson gamma(*) ----------------------------------------- (*) approximate
Depending on the distribution specified, hangroot will only use observations that meet the following criteria:
exponential varname >= 0 lognormal varname >= 0 Weibull varname >= 0 gamma varname >= 0 Gumbel varname >= 0 inverse gamma varname > 0 Pareto varname > 0 wald varname > 0 Fisk varname > 0 Dagum varname > 0 Singh-Maddala varname > 0 Generalized Beta II varname > 0 Poisson varname >= 0 & varname = integer zero inflated Poisson varname >= 0 & varname = integer negative binomial varname >= 0 & varname = integer zero inflated negative binomial varname >= 0 & varname = integer geometric varname >= 0 & varname = integer beta 0 < varname < 1
The geometric distribution is parameterized in terms of the number of failures before the first succes, instead of the number of trials needed to get the first succes.
General options
dist(name) specifies the theoretical distribution with which the empirical distribution is compared. This option will be ignored in the post-estimation syntax. The default is Gaussian.
The option discrete is implied when specifying the geometric or the poisson distribution.
par(numlist) specifies the parameters at which the theoretical distribution is to be fixed. If this option is not specified than the estimated parematers will be used. The table below identifies which parameter is represented by which number in the numlist. The variable of interest is represented by y, the first number in numlist is represented by a, the second by b, the third by c, and the fourth by d.
-------------------------------------------------------------------- distribution parameterization -------------------------------------------------------------------- normal / Gaussian normalden(a, b)
lognormal (1 / (y * b * sqrt(2 * pi))) * exp(-(log(y) - a)^2 / (2 * b^2))
logistic exp(-1*(y - a )/b) / (b*(1+exp(-1*(y - a )/b))^2)
Weibull (a/b)*(y/b)^(a - 1)*exp(-(y/b)^a)
Chi square gammaden(a/2,2,0,y)
gamma gammaden(a, b, 0, y)
Gumbel (1 / b) * exp(-(y - a) / b) * exp(-exp(-(y - a) / b))
inverse gamma b^a/exp(lngamma(a))*y^(-a-1)*exp(-b/y)
Wald / inverse Gaussian sqrt(b/(2*pi*y^3)) * exp(-b*(y-a)^2 / (2*a^2*y))
beta betaden(a,b,y)
Pareto b*a^b/y^(b+1)
Fisk / log-logistic a*((b/y)^a)*(1/<)/(1 + (b/y)^a)^2
Dagum (a*c)*((b/y)^a)* (1/y)/(1 + (b/y)^a)^(c+1)
Singh-Maddala (a*c/b)*((1 + (y/b)^a)^-(c+1))* ((y/b)^(a-1))
Generalized Beta II a*y^(a*c-1)*((b^(a*c))* exp(lngamma(c) + lngamma(d) - lngamma(c + d))* (1 + (y/b)^a )^(c+d))^-1
generalized extreme value 1/b*(1+c*((y-a)/b))^(-1-1/c)* exp(-1*(1+c*((y-a)/b))^(-1/c))
exponential a*exp(-a*y)
Laplace 1/(2*b)*exp(-1*|y-a|/b)
uniform 1/(b-a)
geometric (1-a)^y*a
Poisson exp(-a)*a^y/y!
zip cond(y==0, b + (1-b)*exp(-a), (1-b)*( exp(-a)*a^y/y! )
negative binomial I exp(lngamma(y + a) - lngamma(y+1) - lngamma(a)) * b^y / (1 + b)^(y+a)
negative binomial II exp(lngamma(y + 1/b) - lngamma(y + 1) - lngamma(1/b)) * (a/(1/b + a))^y * (1/(b*(1/b + a)))^(1/b)
zero inflated negative cond(y==0 , c + (1-c)*(1/(1+a*b))^(1/b), binomial (1-c) * exp(lngamma(1/b+y) - lngamma(y+1) - lngamma(1/b)) * (1/(1+a*b))^(1/b) * (1-1/(1+a*b))^y ) --------------------------------------------------------------------
suspended specifies that a suspended rootogram is drawn rather than a hanging rootogram.
notheoretical suppresses the display of the theoretical curve. This option is only allowed in combination with the suspended option.
ci specifies that confidence intervals are drawn around the bottom of the spikes or the bars. These confidence intervals assume that the number of observations in a bin follow a multinomial distribution, and use Goodman's (1965) approximation of the simultaneous confidence interval. These confidence intervals do not take into account that the parameters in the theoretical distribution are also estimated. These confidence intervals also do not take into account that nearby bins are likely to be similar, as was suggested by Vermeesch (2005). However, I would consider this latter point a feature, as this corresponds with the simple non-parametric logic that is behind the histogram and the (hanging) rootogram.
level(#) specifies the confidence level, in percent, for the confidence intervals; see level.
sims(varlist) specifies variables whose empirical distribution will be overlaid on top of the graph. The intended use is that these variables are a set of random samples from the theoretical distribution, thus providing an informal confidence interval.
spike specifies that the empirical distribution is graphed as spikes. This is the default.
bar specifies that the empirical distribution is graphed as bars.
ninter(#) specifies the number of points between bin-midpoints for which the theoretical distribution is computed. It governs the smoothness of the curve representing the theoretical distribution. This option is only allowed when hangroot is used after an estimation command with covariates. The default is 5, and can be any integer between and including 0 and 20.
mainopt(graph_options) specifies options govinging the look of the empirical distribution or the residuals are drawn. By default or when the spike option is specified, these options can be the options of twoway rspike. When the bar option is specified, these options can be the options of twoway rbar. In either case the the by(), horizontal, and vertical options are not allowed. The options that can be specified in {opt} can also be directly added as other_options.
theoropt(graph_options) specifies options govinging the look of the theoretical distribution. These options can be the options of twoway line. This option is not allowed when the notheoretical option is specified.
ciopt(graph_options) specifies options govinging the look of the confidence interval. These can be the options of:
twoway rbar when the suspended option is not specified and the empirical distribution is represented by spikes,
twoway rcap when the suspended option is not specified and the emprical distribution is represented by bars,
twoway rarea when the suspended option is specified.
simsopt(graph_options) specifies options govinging the look of the simulated distributions. These options can be the options of twoway pcspike. This option is only allowed when the sims() option is specified.
jittersims(integer) adds random noice to the vertical possition of the markers representing the simulations, where integer represents the size of the noise as a percentage of the distance between the highest and lowest marker position for the simulated variables.
jitterseed() random number seed for jittersims()
plot(plot) provides a way to add other plots to the generated graph; see: addplot_option (Stata 9 and 10) or plot_option (Stata 8).
Other options When the option bar is specified all twoway rbar options are allowed, except by(), horizontal, and vertical. Otherwise all twoway rspike options are allowed, with the same exceptions.
Options for use in the continuous case
bin(#) and width(#) are alternatives. They specify how the data are to be aggregated into bins; bin() by specifying the number of bins (from which the width can be derived) and width() by specifying the bin width (from which the number of bins can be derived).
If neither option is specified, results are the same as if bin(k) were specified, where
k = min{sqrt(N), 10*ln(N)/ln(10)}
and where N is the number of observations.
start(#) specifies the theoretical minimum of varname. The default is start(m), where m is the observed minimum value of varname.
Specify start() when you are concerned about sparse data, for instance, if you know that varname can have a value of 0, but you are concerned that 0 may not be observed.
start(#), if specified, must be less than or equal to m, or else an error will be issued.
Options for use in the discrete case
discrete specifies that varname is discrete and that you want each unique value of varname to have its own bin (bar of histogram).
width(#) is rarely specified in the discrete case; it specifies the width of the bins. The default is width(d), where d is the observed minimum difference between the unique values of varname.
Specify width() if you are concerned that your data are sparse. For example, in theory varname could take on the values, say, 1, 2, 3, ..., 9, but because of the sparseness, perhaps only the values 2, 4, 7, and 8 are observed. Here the default width calculation would produce width(2) and you would want to specify width(1).
start(#) is also rarely specified in the discrete case; it specifies the theoretical minimum value of varname. The default is start(m), where m is the observed minimum value.
As with width(), you specify start(#) if you are concerned that your data are sparse. In the previous example, you might also want to specify start(1). start() does nothing more than add white space to the left side of the graph.
The value of # in start() must be less than or equal to m, or an error will be issued.
Examples
The residuals after a linear regression regress should be normally distributed, but in this case it appears to follow a bimodal distribution.
sysuse nlsw88, clear gen ln_w = ln(wage) reg ln_w grade age ttl_exp tenure predict resid, resid hangroot resid (click to run)
This bimodal distribution appears to be the result of the ommision of the variable union.
sysuse nlsw88, clear gen ln_w = ln(wage) reg ln_w grade age ttl_exp tenure union predict resid2, resid hangroot resid2 (click to run)
The part of the graph that tells us how wel the distribution fits to the data is the distance between the bottom of the spikes and the horizontal line y=0. So why not explicitly plot these residuals instead? When we do that, it would also make sense to flip the entire graph upside down: In that case bins with too many cases will receive positive residuals and bins with too few cases negative residuals. This is done with the susp option
hangroot resid2, ci susp theoropt(lpattern(-)) (click to run)
One can focuss more on the residuals by removing the theoretical distribution.
hangroot resid2, ci susp notheor (click to run)
hangroot can be used as a post-estimation command. In this case a log normal distribution without covariates was fitted using lognfit, which is available from ssc.
sysuse nlsw88, clear lognfit wage hangroot, ci (click to run)
A hanging rootogram can also be used to compare the distribution of two a variable across two groups. In the example below the wage distribution of those with a college degree is the reference/"theoretical" distribution, and the wage distribution of the respondents without a college degree hangs from it.
sysuse nlsw88, clear hangroot wage, dist(theoretical collgrad) (click to run)
hangroot can also be used to compare an empirical distribution with the marginal distribution implied by a regression model. In this case we create data that is appropriate for linear regression, but the distribution of y looks nothing like a normal distribution.
In Stata >= 10 I would have used rnormal() instead of invnormal(uniform()), but I have not done so here since hangroot is also supposed to work in Stata 9.2.
drop _all set obs 1000 gen byte x = _n <= 250 gen y = -3 + 3*x + invnormal(uniform()) hangroot y, dist(normal) (click to run)
However, the distribution of y fits the marginal distribution implied by the regression model. In this case we could also have inspected the residuals, which should be normally distributed. However, looking at residuals won't work for models that imply other distribution, e.g. Poisson or beta regression, but in those cases one can still inspect the marginal distribution.
reg y x hangroot (click to run)
Some deviation from the theoretical distribution is expected, as the data are typically random draws from a larger population. A nice way to see what kind of variability is still consistent with the model is to create a couple variables that are random draws assuming that the model is correct, and overlay the distribution of these random draws on top of the hanging rootogram.
In Stata >= 10 I would have used rnormal(mu,sd) instead of invnormal(uniform())*sd + mu, but I have not done so here since hangroot is also supposed to work in Stata 9.2.
predict double mu , xb scalar sd = e(rmse) forvalues i = 1/20 { gen sim`i' = invnormal(uniform())*sd + mu } hangroot, sims(sim*) jitter(5) xlab(-6(3)3) (click to run)
Author
Maarten L. Buis Universitaet Tuebingen Institut fuer Soziologie maarten.buis@uni-tuebingen.de
Acknowledgement
Several programming tricks from dpplot by Nick Cox are incorporated in this program.
References
Friendly, M. 2000. Visualizing categorical data. Cary, NC: SAS Institute.
Goodman, L.A. 1965, On Simultaneous Confidence Intervals for Multinomial Proportions. Technometrics, 7(2), pp. 247-254.
Tukey, J.W. 1965. The future of processes of data analysis. Reprinted in Jones, L.V. (ed.) 1986. The collected works of John W. Tukey. Volume IV: Philosophy and principles of data analysis: 1965-1986. Monterey, CA: Wadsworth and Brooks/Cole, 517-547.
Tukey, J.W. and Wilk, M.B. 1965. Data analysis and statistics: principles and practice. Reprinted in Cleveland, W.S. (ed.) 1988. The collected works of John W. Tukey. Volume V: Graphics: 1965-1985. Pacific Grove, CA: Wadsworth and Brooks/Cole, 23-29.
Tukey, J.W. 1972. Some graphic and semigraphic displays. In Bancroft, T.A. and Brown, S.A. (eds) Statistical papers in honor of George W. Snedecor. Ames, IA: Iowa State University Press, 293-316.
Tukey, J.W. 1977, Exploratory Data Analysis, Addison-Wesley.
Vermeesch, P. 2005, Statistical uncertainty associated with histograms in the Earth Sciences, Journal of Geophysical Research - Solid Earth, Vol 110, B02211.
Wainer, H. 1974, The Suspended Rootogram and Other Visual Displays: An Empirical Validation. The American Statistician, 28(4), pp. 143-145.
Also see:
Estimation commands: If installed: lognfit, weibullfit, gammafit, gumbelfit, invgammafit, invgaussfit, betafit, paretofit, fiskfit, dagumfit, smfit, gb2fit gevfit
Alternatives: Online: spikeplot, histogram, qnorm, pnorm, pchi, and qchi.
if installed: dpplot, pbeta, qbeta, pweibull, qweibull, plogn, qlogn, pgamma, qgamma, pgumbel, qgumbel, pinvgauss, qinvgauss, pinvgamma,