Title
oaxaca -- Blinder-Oaxaca decomposition of outcome differentials
Syntax
oaxaca depvar [indepvars] [if] [in] [weight] , by(groupvar) [ options ]
where indepvars is term [term ...]
with term as varlist or ([name:] varlist)
and varlist may contain normalize(spec)
options Description ------------------------------------------------------------------------- Main by(groupvar) specifies the groups; by() is required swap swap groups linear linear decomposition; the default logit logit decomposition probit probit decomposition nodetail suppress detailed decomposition adjust(varlist) adjustment for selection variables
Decomposition type threefold[(reverse)] three-fold decomposition; the default weight(# [# ...]) two-fold decomposition using specified weights pooled[(model_opts)] two-fold decomposition using pooled model including groupvar omega[(model_opts)] two-fold decomposition using pooled model excluding groupvar reference(name) two-fold decomposition using stored model split split unexplained part of two-fold decomposition
SE/SVY svy[(svyspec)] survey data estimation vce(vcetype) vcetype may be may be analytic, robust, cluster clustvar, bootstrap, or jackknife cluster(varname) adjust standard errors for intragroup correlation (Stata 9) fixed[(varlist)] assume non-stochastic regressors suest[(name)] | nosuest do/do not use suest to obtain joint variance matrix nose suppress computation of standard errors
Models model1(model_opts) estimation details for the Group 1 model model2(model_opts) estimation details for the Group 2 model noisily display model estimation output relax do no stop on dropped coefficients/zero variances estopts options passed through to all models
X-Values (linear decomposition only) x1(names_and_values) provide custom X-values for Group 1 x2(names_and_values) provide custom X-values for Group 2
Reporting xb display table with coefficients and means level(#) set confidence level; default is level(95) eform report exponentiated results nolegend suppress legend ------------------------------------------------------------------------- bootstrap, by, jackknife, statsby, and xi are allowed; see prefix. Weights are not allowed with the bootstrap prefix. aweights are not allowed with the jackknife prefix. vce(), cluster(), and weights are not allowed with the svy option. fweights, aweights, pweights, and iweight are allowed; see weight; aweights are not allowed with logit or probit
Description
oaxaca computes the so-called Blinder-Oaxaca decomposition, which is often used to analyze wage gaps by sex or race. depvar is the outcome variable of interest (e.g. log wages) and indepvars are predictors (e.g. education, work experience, etc.). groupvar identifies the groups to be compared. The standard errors of the decomposition components are computed using the delta method and take into account the variability induced by stochastic regressors. For methods and formulas see Jann (2008).
oaxaca also supports the non-linear decomposition for binary dependent variables proposed by Yun (2004). See the logit and probit options. An alternative non-linear decomposition for binary dependent variables, suggested by Fairlie (2005), is available as fairlie from the SSC Archive (see ssc describe fairlie).
oaxaca typed without arguments replays the last results, optionally applying xb, level(), eform, or nolegend.
Subsume results for sets of variables
Decomposition results can be aggregated for subsets of variables using syntax
... ([name:] varlist) ...
where name provides a label for the subset (the name of the first variable in the subset is used as label if name is omitted). For example, you could type
. oaxaca lnwage educ (expten: exper tenure), by(female)
to subsume the contributions of exper and tenure. Apart from variable names, also _cons and _offset can be specified as part of a subset.
Normalization of categorical variables
For categorical regressors, the detailed decomposition results depend on the choice of the (omitted) base category. A solution is to compute the decomposition based on "normalized" effects, i.e. effects that are expressed as deviation contrasts from the grand mean (Yun 2005). To "normalize" the effects for a set of indicator variables representing a categorical variable include the indicator variables in the list of regressors using syntax
... normalize(spec) ...
where spec usually simply is the list of indicator variables. Note that an indicator variable has to be supplied for every category (including the base category). For example, you could type
. tabulate isco, generate(isco) nofreq . oaxaca lnwage educ exper normalize(isco1-isco9), by(female)
The tablate, generate() command is a convenient way to generate a set of indicator variables from a categorical variable (such as the 9 major group ISCO-88 job classification). The base category to be omitted from model estimation can be designated using the b. operator, but this should not affect the decomposition results. For example, you could type
... normalize(married b.single divorced) ...
The first variable is taken if no base category is marked.
Note that normalize() is allowed within subsumed variable sets. For example, you could type
... (family: kids6 normalize(married b.single divorced)) ...
Normalization can also be applied to interactions between a categorical variable and a continuous variable. In this case, type # followed by the name of the continuous variable at the end in normalize(). Because usually you would also want to normalize the main effects you should supply two normalize() statements, one for the main effects and one for the interactions. Example: Suppose d1, d2, and d3 are indicator variables representing a categorical variable and d1x, d2x, and d3x are interactions of these indicators with a continuous variable x. You could then type
... normalize(d1 d2 d3) normalize(d1x d2x d3x # x) ...
Options
+------+ ----+ Main +-------------------------------------------------------------
by(groupvar) specifies the groupvar that defines the two groups that are to be compared. by() is required.
swap reverses the order of the groups.
linear causes the standard linear decomposition to be computed. This is the default. The estimation command for the group models defaults to regress.
logit causes the non-linear decomposition for a binary dependent variable to be computed using the weighting method described by Yun (2004). The estimation command for the group models defaults to logit.
probit causes the non-linear decomposition for a binary dependent variable to be computed using the weighting method described by Yun (2004). The estimation command for the group models defaults to probit.
Only one of linear, logit, or probit is allowed.
nodetail suppresses the detailed results and only computes the overall decomposition.
adjust(varlist) causes the group differential to be adjusted by the contribution of the specified variables before computing the decomposition. This is useful, for example, if the specified variables are selection terms. Note that adjust() is not needed if heckman is used to estimate the models. _offset is allowed in adjust().
+--------------------+ ----+ Decomposition type +-----------------------------------------------
threefold[(reverse)] computes the three-fold decomposition. This is the default. The decomposition is expressed from the viewpoint of Group 2. Specify threefold(reverse) to express the decomposition from the viewpoint of Group 1.
weight(# [# ...]) computes the two-fold decomposition where # [# ...] are the weights given to Group 1 relative to Group 2 in determining the reference coefficients (weights are recycled if there are more coefficients than weights). For example, weight(1) uses the Group 1 coefficients as the reference coefficients, weight(0) uses the Group 2 coefficients.
pooled[(model_opts)] computes the two-fold decomposition using the coefficients from a pooled model over both groups as the reference coefficients. groupvar is included in the pooled model as an additional control variable. Estimation details may be specified in parentheses; see the model1() option below.
omega[(model_opts)] computes the two-fold decomposition using the coefficients from a pooled model over both groups as the reference coefficients (without including groupvar as a control variable). Estimation details may be specified in parentheses; see the model1() option below.
reference(name) computes the two-fold decomposition using the coefficients from a stored model. name is the name under which the model was stored; see estimates store. It is suggested not to combine reference() with vce(bootstrap) or vce(jackknife).
split causes the "unexplained" component in the two-fold decomposition to be split into a part related to Group 1 and a part related to Group 2.
Only one of threefold, weight(), pooled, omega, and reference() is allowed.
+----------+ ----+ X-Values +---------------------------------------------------------
x1(names_and_values) and x2(names_and_values) provide custom values for specific predictors to be used for Group 1 and Group 2 in the decomposition (only allowed with linear decomposition). The default is to use the group means of the predictors. The syntax for names_and_values is
varname [=] value [[,] varname [=] value ... ]
Example: x1(educ 12 exp 30)
+--------+ ----+ SE/SVY +-----------------------------------------------------------
svy[([vcetype] [, svy_options])] executes oaxaca while accounting for the survey settings identified by svyset (this is essentially equivalent to applying the svy prefix command, although the svy prefix is not allowed with oaxaca due to some technical issues). vcetype and svy_options are as described in help svy.
vce(vcetype) specifies the type of standard errors reported. vcetype may be may be analytic (the default), robust, cluster clustvar, bootstrap, or jackknife; see [R] vce_option.
cluster(varname) adjusts standard errors for intragroup correlation; this is Stata 9 syntax for vce(cluster clustvar).
fixed[(varlist)] identifies fixed regressors (all if specified without argument; an example for fixed regressors are experimental factors). The default is to treat regressors as stochastic. Stochastic regressors inflate the standard errors of the decomposition components.
suest[(name)] enforces using suest to obtain the covariances between the models/groups. suest is implied by pooled, omega, reference(), svy, vce(cluster), and cluster(). Specify suest(name) to save suest's estimation results under name using estimates store. nosuest prevents applying suest (this may cause biased standard errors).
nose suppresses the computation of standard errors.
+------------------+ ----+ Model estimation +-------------------------------------------------
model1(model_opts) and model2(model_opts) specify the estimation details for the two group models. The syntax for model_opts is
[estcom] [, store(name) addrhs(spec) estcom_options ]
where estcom is the estimation command to be used and estcom_options are options allowed by estcom. store(name) saves the model's estimation results under name using estimates store. addrhs(spec) adds spec to the "right-hand side" of the model. For example, use addrhs() to add extra variables to the model. Examples:
model1(heckman, select(varlist_s) twostep)
model1(ivregress 2sls, addrhs((varlist2=varlist_iv)))
Note that oaxaca uses the first equation if a model contains multiple equations. Furthermore, coefficients that only occur in one of the models are assumed zero in the other model. It is required, however, that the associated variables contain non-missing values for all observations in both groups.
noisily displays the models' estimation output.
relax causes oaxaca to continue its computations even if coefficients are dropped from the models (e.g. due to collinearity) or if some coefficients have zero variances. The default is to return error in such a situation.
estopts are common options to be passed through to the models.
+-----------+ ----+ Reporting +--------------------------------------------------------
xb displays a table containing the regression coefficients and predictor values on which the decomposition is based.
level(#) specifies the confidence level, as a percentage, for confidence intervals. The default is level(95) or as set by set level.
eform specifies that the results be displayed in exponentiated form.
nolegend suppresses the legend about the sets of independent variables.
Examples
. use http://fmwww.bc.edu/RePEc/bocode/o/oaxaca.dta
. oaxaca lnwage educ exper tenure, by(female)
. oaxaca lnwage educ exper tenure, by(female) weight(1)
. oaxaca lnwage educ exper tenure, by(female) pooled
. svyset [pw=wt] . oaxaca lnwage educ exper tenure, by(female) pooled svy
. oaxaca lnwage educ exper tenure, by(female) pooled vce(bootstrap)
. tabulate isco, nofreq generate(isco) . oaxaca lnwage educ exper tenure normalize(isco?), by(female) pooled
. use http://fmwww.bc.edu/RePEc/bocode/h/homecomp.dta, clear . oaxaca homecomp female age (educ:hsgrad somecol college) (marstat:married prevmar) if white==1|black==1, by(black) logit pooled
Saved Results
Scalars e(N) number of observations e(N_1) number of observations in Group 1 e(N_2) number of observations in Group 2 e(N_clust) number of clusters
Macros e(cmd) oaxaca e(title) Blinder-Oaxaca decomposition e(by) name group variable e(group_1) value of group variable for Group 1 e(group_2) value of group variable for Group 2 e(depvar) name of dependent variable e(model) linear, logit, or probit e(threefold) threefold, threefold(reverse), or empty e(weights) weights specified by weight() or empty e(refcoefs) pooled, omega, name of reference model, or empty e(legend) definitions of regressor sets e(normalized) normalized indicator sets e(adjust) names of adjustment variables e(fixed) fixed, fixed(varlist), or empty e(suest) suest or empty e(wtype) weight type e(wexp) weight expression e(clustvar) name of cluster variable e(vce) vcetype specified in vce() e(vcetype) title used to label Std. Err. e(properties) b V
Matrices e(b) decomposition results e(V) variance-covariance matrix of decomposition results e(b0) vector containing coefficients and X-values e(V0) variance-covariance matrix of e(b0)
Functions e(sample) marks estimation sample
References
Fairlie, Robert W. (2005). An extension of the Blinder-Oaxaca decomposition technique to logit and probit models. Journal of Economic and Social Measurement 30: 305-316.
Jann, Ben (2008). The Blinder-Oaxaca decomposition for linear regression models. The Stata Journal 8(4): 453-479. [Working paper version available from: http://ideas.repec.org/p/ets/wpaper/5.html]
Yun, Myeong-Su (2004). Decomposing differences in the first moment. Economics Letters 82: 275-280.
Yun, Myeong-Su (2005). A Simple Solution to the Identification Problem in Detailed Wage Decompositions. Economic Inquiry 43: 766-772.
Author
Ben Jann, Institute of Sociology, University of Bern, jann@soz.unibe.ch
Also see
Online: help for regress, logit, probit, heckman, suest, svyset; fairlie