help formargdistfit-------------------------------------------------------------------------------

Title

margdistfit-- Post-estimation command that compares the observed and theoretical marginal distributions.

Syntax

margdistfit,[ {pp|cumul|hangroot[(hangroot_options)] }sims(#)noparsampobsopts(scatter_options)refopts(line_options)simopts(line_options)nosquaree(#)]

Description

margdistfitis a post-estimation command for checking how well distributional assumptions of a regression model fit to the data. It does so by comparing the marginal distribution implied by the regression model to the distribution of the dependent variable. This comparison is done through either a probability-probabilty plot, a quantile-quantile plot, a hanging rootogram, or a plot of the two cumulative distribution functions.The key concept in this command is the marginal distribution. Regression models assume a distribution for the dependent variable, and this distribution can be described in terms of a small number of parameters: e.g. the mean and the standard deviation in case of the normal/Gaussian distribution. One or more of these distribution parameters, typically the mean, is allowed to differ from observation to observation depending on the values of the explanatory variables. As a consequence, the distribution of the explained variable implied by the model is a mixture distribution such that each observation has its own parameters. This is the marginal distribution.

To give an indication of how much deviation from the theoretical distribution is still legitimate, the graph will also show the distribution of several (by default 20) simulated variables under the assumption that the regression model is true. By default, the simulations include both uncertainty due to uncertainty about the parameter estimates and uncertainty due to the fact that they are random draws from a distribution. This is achieved by creating the simulated variables in two steps: first the parameters are drawn from their sampling distribution, and than the simulated variable is drawn given those parameters.

margdistfitmay be used after estimating a model with regress, poisson, zip, nbreg, gnbreg, zinb, or betafit (the latter is available from ssc).

Options

ppspecifies that a probability-probability plot is to be displayed. This graph is best for looking at the comparison of the theoretical and observed distribution in the middle of the distribution. It may not be combined withcumul, orhangroot.

pp,cumul, orhangroot.

cumulspecifies that the observed and theoretical cumulative density functions are to be graphed. It may not be combined withpp,cumul, orhangroot.

hangroot[(hangroot_options)] specifies that a hanging rootogram is used to compare the observed and theoretical distributions. This requires that thehangrootpackage is installed, which is available from ssc. It may not be combined withpp,cumul.

sims(#)specifies the number of simulated variables, the default is 20.

noparsampspecifies that the simulated variables should be drawn from the distribution with parameters based on the point estimates of the model and avoid drawing the parameters from the sampling distribution.

obsopts(scatter_options)options governing how the distribution of the observed variable looks.

refopts(line_options)options governing how the reference line looks.

simopts(line_options)options governing how the distributions of the simulated variable look.

nosquarespecifies that the graph is not forced to be square. By default the probability-probability and quantile-quantile plots are forced to be square as a perfect fit is represented by the 45 degree line. By forcing the graph to be square the 45 degree line truely has an angle of 45 degrees. This option is not allowed in combination withcumulorhangroot.

e(#)specifies the maximal error used when approximating the quantile function or cumulative density function. The quantile function is computed using the algorithm discussed in (hoermann and leydold 2003). A similar algorithm is used to compute the cumulative density function. The latter is strictly speaking not necessary, but it significantly speeds up the computation in medium to large datasets. Withpporcumulit may be a number between 0 and 1e-3. The cumulative density function will be directly computed instead of approximated when a number less than 1e-12 is specified. With

ExamplesA well fitting model:

sysuse nlsw88, cleargen lnw = ln(wage)reg lnw grade ttl_exp tenure unionmargdistfit, qq(click to run)A not so well fitting model. Note that linear regression is typically quite robust against deviations from this assumption. However, knowing that such deviations exist in your data and substantively understanding why they are there can add a lot "flesh" to the "bare bones" of your model.

sysuse auto, clearreg price mpg foreignmargdistfit, pp(click to run)An example created to illustrate that the marginal distribution can look very different from what one may expect. I use

regress, so I assume a normal distribution where the mean can change from observation to observation depending on the value of x. In this case the data was created such that we should see a distribution of y that has consists of two humps, one at -2 and the other at 2, which is indeed the case.

preserveset seed 12345drop _allset obs 500gen x = runiform() < .5gen y = -2 + 4*x + rnormal()regress y xmargdistfit, hangroot(jitter(5))restore(click to run)An example that can be used to compare the fit of several count models.

The strange pattern in the last graph is due to the large sampling variability in the inflation parameter, and by default the parameters are for each simulation drawn from the sampling distribution. That way some of the samples are drawn from a distribution where the probability of a degenerate zero is 1 - that is, the distribution reduces to a spike at 0 - while for the other samples that probability is 0 - that is, the distribution reduces to a negative binomial. This means that in essence the

zinbmodel is not appropriate for this data.

preserveuse http://www.stata-press.com/data/lf2/couart2,clearmkspline ment1 20 ment2 = ment// this is just to ensure that graph names do not conflict// with any graph name you have opentempname poisson zip nb zinbpoisson art fem mar kid5 phd ment1 ment2margdistfit, hangroot(susp notheor jitter(2)) title(poisson) name(`poisson'> )zip art fem mar kid5 phd ment1 ment2, inflate(_cons)margdistfit, hangroot(susp notheor jitter(2)) title(zip) name(`zip')nbreg art fem mar kid5 phd ment1 ment2margdistfit, hangroot(susp notheor jitter(2)) title(nbreg) name(`nb')zinb art fem mar kid5 phd ment1 ment2, inflate(_cons)margdistfit, hangroot(susp notheor jitter(2)) title(zinb) name(`zinb')restore(click to run)

AuthorMaarten L. Buis Universitaet Tuebingen Institut fuer Soziologie maarten.buis@uni-tuebingen.de

ReferencesHoermann, Wolfgang and Leydold, Josef. (2003). Continuous random variate generation by fast numerical inversion.

ACM Transactions on Modelingand Computer Simulation,13(4): 347--362.

Garry Anderson, David Ashcraft, Ronan Conroy, Nick Cox and Austin Nichols (in alphabetical order) made several useful comments.Acknowledgement

Also seeOnline:

pnorm,qnormIf installed:

hangroot,qplot,pbeta,qbeta