Fitting a zero one inflated beta distribution by maximum likelihood
zoib depvar [indepvars] [weight] [if] [in] [, oneinflate(varlist_o) zeroinflate(varlist_z) nozero noone phivar(varlist_p) robust cluster(clustervar) level(#) maximize_options ]
by ... : may be used with zoib; see help by.
fweights, pweights, and aweights are allowed; see help weights.
When using Stata version 11 or higher, indepvars, oneinflate(), zeroinflate(), and phivar() may contain factor variables; see fvvarlist.
zoib fits by maximum likelihood a zero one inflated beta distribution to a distribution of a variable depvar. depvar ranges between 0 and 1: for example, it may be a proportion. It will estimate the probabilities of having the value 0 and/or 1 as separate processes. The logic is that we can often think of proportions of 0 or 1 as being qualitatively different and generated through a different process as the other proportions.
The zero one inflated beta distribution consists of three parts:
a probability that depvar = 0 a probability that depvar = 1 the distribution of depvar given that 0 < depvar < 1
This means that the likelihood is:
[1-(Pr(depvar = 0)] * [1-Pr(depvar = 1)] * Beta(depvar | mu, phi) if 0 > < depvar < 1 Pr(depvar = 0) if depvar = 0 Pr(depvar = 1) if depvar = 1
The zero inflation and one inflation parts of this model are by default included whenever the dependent variable contains the value 0 and 1 respectively, and excluded otherwise. The user can force the exclusion of the of these parts by specifying the nozero and noone options.
The the effects on the log odds of having the value 0 or 1 on the variable depvar are represented in the zeroinflate and oneinflate equations respectively. The remaining proportions are modelled using a beta-distribution useing the parameterization discussed in (e.g. Ferrari and Cribari-Neto 2004, Paolino 2001, or Smithson and Verkuilen 2006). These effects are also reported on the logit scale.
An alternative to zoib is to assume the proportions represent rare events that did not have had the time to get a single realization, so the 0s and 1s are created via the same process as all the other proportions. In this case one can use a fractional logit model as proposed by Papke and Wooldridge (1996), which can be estimated using glm, see: http://www.stata.com/support/faqs/stat/logit.html.
zeroinflate() specifies the variables the influence the log odds of having the value 0 on depvar. This option can only be specified if the value 0 exists in depvar.
oneinflate() specifies the variables the influence the log odds of having the value 1 on depvar. This option can only be specified if the value 1 exists in depvar.
nozero specifies that no zero inflation equation is to be estimated. This implies that all observations with the value 0 on depvar will be ignored.
noone specifies that no one inflation equation is to be estimated. This implies that all observations with the value 1 on depvar will be ignored.
phivar() allow the user to specify each the scale parameter for the beta part of the zero one inflated beta distribution as a function of the covariates specified in the respective variable list. A constant term is always included in each equation.
robust specifies that the Huber/White/sandwich estimator of variance is to be used in place of the traditional calculation; see [U] 23.14 Obtaining robust variance estimates. robust combined with cluster() allows observations which are not independent within cluster (although they must be independent between clusters).
cluster(clustervar) specifies that the observations are independent across groups (clusters) but not necessarily within groups. clustervar specifies to which group each observation belongs; e.g., cluster(personid) in data with repeated observations on individuals. See [U] 23.14 Obtaining robust variance estimates. Specifying cluster() implies robust.
level(#) specifies the confidence level, in percent, for the confidence intervals of the coefficients; see help level.
nolog suppresses the iteration log.
maximize_options control the maximization process; see help maximize. If you are seeing many "(not concave)" messages in the log, using the difficult option may help convergence.
use k401.dta, clear
replace totemp = totemp/100
zoib prate mrate totemp age sole, /// oneinflate( mrate totemp age sole)
Maarten L. Buis, WZB email@example.com
Cook, D.O., Kieschnick, R. and McCullough, B.D. 2008. Regression analysis of proportions in finance with self selection. Journal of Empirical Finance 15(5):860-867.
Evans, M., Hastings, N. and Peacock, B. 2000. Statistical distributions. New York: John Wiley.
Ferrari, S.L.P. and Cribari-Neto, F. 2004. Beta regression for modelling rates and proportions. Journal of Applied Statistics 31(7): 799-815.
Johnson, N.L., Kotz, S. and Balakrishnan, N. 1995. Continuous univariate distributions: Volume 2. New York: John Wiley.
MacKay, D.J.C. 2003. Information theory, inference, and learning algorithms. Cambridge: Cambridge University Press (see p.316). http://www.inference.phy.cam.ac.uk/itprnn/book.pdf
Papke, L.E. and Wooldridge, J.M. 1996. Econometric Methods for Fractional Response Variables with an Application to 401(k) Plan Participation Rates. Journal of Applied Econometrics 11(6):619-632.
Paolino, P. 2001. Maximum likelihood estimation of models with beta-distributed dependent variables. Political Analysis 9(4): 325-346. http://polmeth.wustl.edu/polanalysis/vol/9/WV008-Paolino.pdf
Smithson, M. and Verkuilen, J. 2006. A better lemon squeezer? Maximum likelihood regression with beta-distributed dependent variables. Psychological Methods 11(1): 54-71.
Jeroan Allison helpfully identified a bug in a pervious version of the predict function of zoib.
Online: help for zoib postestimation, glm