```help for margdistfit
-------------------------------------------------------------------------------

Title

margdistfit -- Post-estimation command that compares the observed and
theoretical marginal distributions.

Syntax

margdistfit , [ { pp | qq | cumul | hangroot[(hangroot_options)] }
sims(#) noparsamp obsopts(scatter_options)
refopts(line_options) simopts(line_options) nosquare e(#) ]

Description

margdistfit is a post-estimation command for checking how well
distributional assumptions of a regression model fit to the data. It does
so by comparing the marginal distribution implied by the regression model
to the distribution of the dependent variable. This comparison is done
through either a probability-probabilty plot, a quantile-quantile plot, a
hanging rootogram, or a plot of the two cumulative distribution
functions.

The key concept in this command is the marginal distribution. Regression
models assume a distribution for the dependent variable, and this
distribution can be described in terms of a small number of parameters:
e.g. the mean and the standard deviation in case of the normal/Gaussian
distribution. One or more of these distribution parameters, typically the
mean, is allowed to differ from observation to observation depending on
the values of the explanatory variables. As a consequence, the
distribution of the explained variable implied by the model is a mixture
distribution such that each observation has its own parameters. This is
the marginal distribution.

To give an indication of how much deviation from the theoretical
distribution is still legitimate, the graph will also show the
distribution of several (by default 20) simulated variables under the
assumption that the regression model is true. By default, the simulations
include both uncertainty due to uncertainty about the parameter estimates
and uncertainty due to the fact that they are random draws from a
distribution. This is achieved by creating the simulated variables in two
steps:  first the parameters are drawn from their sampling distribution,
and than the simulated variable is drawn given those parameters.

margdistfit may be used after estimating a model with regress, poisson,
zip, nbreg, gnbreg, zinb, or betafit (the latter is available from ssc).

Options

pp specifies that a probability-probability plot is to be displayed. This
graph is best for looking at the comparison of the theoretical and
observed distribution in the middle of the distribution. It may not
be combined with qq, cumul, or hangroot.

qq specifies that a quantile-quantile plot is to be displayed. This graph
is best for looking at the comparison of the theoretical and observed
distribution in the tails of the distribution. This is the default.
It may not be combined with pp, cumul, or hangroot.

cumul specifies that the observed and theoretical cumulative density
functions are to be graphed. It may not be combined with pp, cumul,
or hangroot.

hangroot[(hangroot_options)] specifies that a hanging rootogram is used
to compare the observed and theoretical distributions. This requires
that the hangroot package is installed, which is available from ssc.
It may not be combined with pp, qq, or cumul.

sims(#) specifies the number of simulated variables, the default is 20.

noparsamp specifies that the simulated variables should be drawn from the
distribution with parameters based on the point estimates of the
model and avoid drawing the parameters from the sampling
distribution.

obsopts(scatter_options) options governing how the distribution of the
observed variable looks.

refopts(line_options) options governing how the reference line looks.

simopts(line_options) options governing how the distributions of the
simulated variable look.

nosquare specifies that the graph is not forced to be square. By default
the probability-probability and quantile-quantile plots are forced to
be square as a perfect fit is represented by the 45 degree line. By
forcing the graph to be square the 45 degree line truely has an angle
of 45 degrees. This option is not allowed in combination with cumul
or hangroot.

e(#) specifies the maximal error used when approximating the quantile
function or cumulative density function. The quantile function is
computed using the algorithm discussed in (hoermann and leydold
2003). A similar algorithm is used to compute the cumulative density
function. The latter is strictly speaking not necessary, but it
significantly speeds up the computation in medium to large datasets.
With pp or cumul it may be a number between 0 and 1e-3. The
cumulative density function will be directly computed instead of
approximated when a number less than 1e-12 is specified.  With qq it
may be a number between 1e-12 and 1e-3. The default is
min(1e-6,10^-ceil(log10(N))), where N is the sample size.

Examples

A well fitting model:

sysuse nlsw88, clear
gen lnw = ln(wage)
reg lnw grade ttl_exp tenure union
margdistfit, qq
(click to run)

A not so well fitting model. Note that linear regression is typically
quite robust against deviations from this assumption. However, knowing
that such deviations exist in your data and substantively understanding
why they are there can add a lot "flesh" to the "bare bones" of your
model.

sysuse auto, clear
reg price mpg foreign
margdistfit, pp
(click to run)

An example created to illustrate that the marginal distribution can look
very different from what one may expect. I use regress, so I assume a
normal distribution where the mean can change from observation to
observation depending on the value of x. In this case the data was
created such that we should see a distribution of y that has consists of
two humps, one at -2 and the other at 2, which is indeed the case.

preserve
set seed 12345
drop _all
set obs 500
gen x = runiform() < .5
gen y = -2 + 4*x + rnormal()
regress y x
margdistfit, hangroot(jitter(5))
restore
(click to run)

An example that can be used to compare the fit of several count models.

The strange pattern in the last graph is due to the large sampling
variability in the inflation parameter, and by default the parameters are
for each simulation drawn from the sampling distribution. That way some
of the samples are drawn from a distribution where the probability of a
degenerate zero is 1 - that is, the distribution reduces to a spike at 0
- while for the other samples that probability is 0 - that is, the
distribution reduces to a negative binomial. This means that in essence
the zinb model is not appropriate for this data.

preserve
use http://www.stata-press.com/data/lf2/couart2,clear
mkspline ment1 20 ment2 = ment

// this is just to ensure that graph names do not conflict
// with any graph name you have open
tempname poisson zip nb zinb

poisson art fem mar kid5 phd ment1 ment2
margdistfit, hangroot(susp notheor jitter(2)) title(poisson) name(`poisson'
> )

zip art fem mar kid5 phd ment1 ment2, inflate(_cons)
margdistfit, hangroot(susp notheor jitter(2)) title(zip) name(`zip')

nbreg art fem mar kid5 phd ment1 ment2
margdistfit, hangroot(susp notheor jitter(2)) title(nbreg) name(`nb')

zinb art fem mar kid5 phd ment1 ment2, inflate(_cons)
margdistfit, hangroot(susp notheor jitter(2)) title(zinb) name(`zinb')

restore
(click to run)

Author

Maarten L. Buis
Universitaet Tuebingen
Institut fuer Soziologie
maarten.buis@uni-tuebingen.de

References

Hoermann, Wolfgang and Leydold, Josef. (2003). Continuous random variate
generation by fast numerical inversion.  ACM Transactions on Modeling
and Computer Simulation, 13(4): 347--362.

Acknowledgement
Garry Anderson, David Ashcraft, Ronan Conroy, Nick Cox and Austin Nichols
(in alphabetical order) made several useful comments.

Also see

Online: pnorm, qnorm

If installed: hangroot, qplot, pbeta, qbeta
```