```
-------------------------------------------------------------------------------
help for decomp
-------------------------------------------------------------------------------

Decomposition of wage gaps

Syntax involves a sequence of steps:

regress varlist [weight] if exp (where exp is group==high wage, for
example, race==1)

himod [weight] [,ds]

regress varlist [weight] if exp (where exp is group==low wage, for
example, race==2)

lomod [weight] [,ds]

decomp [,r]

aweights and fweights are allowed; see weights.

Description

decomp computes Blinder-Oaxaca wage decompositions. It compares the
results from two regressions, using intermediate commands (himod and
lomod), and produces a table of output containing the decompositions.
These decompositions show how much of the wage gap is due to differing
endowments between the two groups, and how much is due to discrimination
(regarded as the portion of the wage gap due to the combined effect of
coefficients and slope intercepts for the two groups).

decomp is designed for Stata's regress command, but also works with other
regression commands, such as ivreg and tobit.  The previous version
required a heck option if decomp was used with Stata's heckman command.
This is no longer necessary. decomp now recognises if the regression is a
heckman type and takes account of this. This is also the case with tobit
regression, which decomp also automatically recognises. This means that
the only option which may be specified with himod or lomod is ds. Existing
user syntax containing the heck option should be edited to remove this
term.

See oaxaca by Ben Jann for a package which is far more comprehensive and
up-to-date than decomp.

Options

Option for himod and lomod is ds (details).This provides a table of
coefficients, means and predictions for each of the regressions. These are
the data used by decomp to conduct the decomposition.

Options for decomp are r (reverse), which computes the decomposition with
the low-wage group as the reference point. See below for more details.

To make use of weighting, weights (either aweights or fweights) must be
applied in the regression commands, and then repeated in the himod and
lomod routines. No weights should be specified when decomp itself is run.

Method

In essence, the Blinder-Oaxaca decomposition breaks down the wage gap
between high-wage and low-wage workers into several components. The
unexplained component is the difference in the shift coefficients (or
constants) between the two wage equations. Being inexplicable, this
component can be attributed to discrimination. However, Blinder also
argued that the explained component of the wage gap also contains a
portion that is due to discrimination. To examine this Blinder decomposed
the explained component into:

1. the differences in endowments between the two groups,
"as evaluated by the high-wage group's wage equation" ;
and

2. "the difference between how the high-wage equation would
value the characteristics of the low-wage group, and how
the low-wage equation actually values them".

Blinder called the first part the amount "attributable to the endowments"
and the second part the amount "attributable to the coefficients", and he
argued that the second part should also be viewed as reflecting
discrimination:

"[this] only exists because the market evaluates
differently the identical bundle of traits if possessed by
members of different demographic groups, [and] is a
reflection of discrimination as much as the shift
coefficient is."

decomp closely follows Blinder's exposition and uses both his method and
his terminology.  decomp takes the average endowment differences between
the two groups and weights them (multiplies them) by the high-wage
workers'estimated coefficients.  The differences in the estimated
coefficients are weighted (multiplied by) the average characteristics of
the low-wage workers.

Conventionally, the high-wage group's wage structure is regarded as the
"non-discriminatory norm", that is, the reference group. With the reverse
option (r) switched on, the low-wage group becomes the reference group.
The average endowment differences are now weighted by the low-wage
workers' estimated coefficients, and the coefficient differences are
weighted by the mean characteristics of the high-wage workers.

The results from decomp are presented using Blinder's (1973) original
formulation of E, C, U and D.

The endowments (E) component of the decomposition is the sum of (the
coefficient vector of the regressors of the high-wage group) times (the
difference in group means between the high-wage and low-wage groups for
the vector of regressors).

The coefficients (C) component of the decomposition is the sum of the
(group means of the low-wage group for the vector of regressors) times
(the difference between the regression coefficients of the high-wage group
and the low-wage group).

The unexplained portion of the differential (U) is the difference in
constants between the high-wage wage and the low-wage group.

The portion of the differential due to discrimination is C + U.

The raw (or total) differential is E + C + U.

Examples

-------------------------------------------------------------------------------

Using regress in a wage equation where high wage and low wage is based on race:

. use http://www.stata-press.com/data/r8/nlswork
(National Longitudinal Survey.  Young Women 14-26 years of age in 1968)

. keep if year==88
(26262 observations deleted)

. reg ln_wage age tenure collgrad if race==1

Source |       SS       df       MS              Number of obs =    1636
-------------+------------------------------           F(  3,  1632) =   90.03
Model |  81.4751215     3  27.1583738           Prob > F      =  0.0000
Residual |  492.287598  1632  .301646812           R-squared     =  0.1420
Total |  573.762719  1635  .350925211           Root MSE      =  .54922

------------------------------------------------------------------------------
ln_wage |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
age |  -.0071553   .0044197    -1.62   0.106    -.0158241    .0015136
tenure |   .0292267   .0024998    11.69   0.000     .0243235    .0341298
collgrad |   .3271635   .0311724    10.50   0.000     .2660213    .3883057
_cons |   1.953557   .1737702    11.24   0.000     1.612721    2.294393
------------------------------------------------------------------------------

himod, ds

Coefficients, means & predictions for high model

------------------------------------------------------
Variable |    Coefficent        Mean    Prediction
-------------+----------------------------------------
age |        -0.007      39.263        -0.281
tenure |         0.029       5.802         0.170
_cons |         1.954       1.000         1.954
------------------------------------------------------

Prediction (ln):     1.926
Prediction (\$):      6.86

. reg ln_wage age tenure collgrad if race==2

Source |       SS       df       MS              Number of obs =     580
-------------+------------------------------           F(  3,   576) =   59.86
Model |  45.9587803     3  15.3195934           Prob > F      =  0.0000
Residual |  147.408098   576  .255916836           R-squared     =  0.2377
Total |  193.366878   579  .333966974           Root MSE      =  .50588

------------------------------------------------------------------------------
ln_wage |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
age |  -.0091953    .007085    -1.30   0.195    -.0231109    .0047204
tenure |   .0267151   .0037902     7.05   0.000     .0192708    .0341593
collgrad |   .5721103   .0558089    10.25   0.000     .4624966     .681724
_cons |   1.842348   .2754947     6.69   0.000     1.301252    2.383445
------------------------------------------------------------------------------

. lomod, ds

Coefficients, means & predictions for low model

------------------------------------------------------
Variable |    Coefficent        Mean    Prediction
-------------+----------------------------------------
age |        -0.009      38.828        -0.357
tenure |         0.027       6.490         0.173
_cons |         1.842       1.000         1.842
------------------------------------------------------

Prediction (ln):     1.759
Prediction (\$):      5.81

. decomp

Decomposition results for variables (as %s)

------------------------------------------------------
Variable |        Attrib       Endow         Coeff
-------------+----------------------------------------
age |           7.6        -0.3           7.9
tenure |          -0.4        -2.0           1.6
-------------+----------------------------------------
Subtotal |           5.6         0.3           5.2
------------------------------------------------------

Summary of decomposition results (as %)

-------------------------------------------
Amount attributable:             |      5.6
- due to endowments (E):         |      0.3
- due to coefficients (C):       |      5.2
Shift coefficient (U):           |     11.1
Raw differential (R) {E+C+U}:    |     16.7
Adjusted differential (D) {C+U}: |     16.4
---------------------------------+---------
Endowments as % total (E/R):     |      2.0
Discrimination as % total (D/R): |     98.0
-------------------------------------------

U = unexplained portion of differential
(difference between model constants)
D = portion due to discrimination (C+U)

positive number indicates advantage to high group
negative number indicates advantage to low group

Interpreting the results:

By comparing the output from the two regression equations is is clear
that white workers have higher constants and this is reflected in the
11.1% advantage in U (the shift coefficient). White workers also have
higher returns to age and tenure, but not to college graduation.
Nevertheless, the size of the age coefficient is such as to offset
this last factor, leaving white workers with a net advantage in C of
5.2%.  There is little difference in endowments between the two
groups, something evident from a comparison of the himod and lomod
output, which shows that there is little difference (apart from
college graduation) between the average group characteristics of
white and black workers. This lack of group differences is reflected
in the small figure for E, just 0.3%.

Consequently, there is little difference between the raw differential
(16.7%) and the adjusted differential (16.4%) because the difference
in endowments between white and black workers is so small. In other
words, almost all of the difference (98%) is due to discrimination,
and this is made up of the difference in the shift coefficient (U)
and differences in how the endowments are rewarded (C).

-------------------------------------------------------------------------------

Using heckman in a wage equation where high wage and low wage is based on
county.  Note the absence of the earlier heck option.

. use http://www.stata-press.com/data/r8/womenwk
(657 missing values generated)

. heckman lnwage educ age, select(married children educ age), if county==9
note: married dropped due to collinearity

Iteration 0:   log likelihood = -74.063916
Iteration 1:   log likelihood = -74.036062
Iteration 2:   log likelihood = -74.036026
Iteration 3:   log likelihood = -74.036026

Heckman selection model                         Number of obs      =       200
(regression model with sample selection)        Censored obs       =        36
Uncensored obs     =       164

Wald chi2(2)       =     28.14
Log likelihood = -74.03603                      Prob > chi2        =    0.0000

------------------------------------------------------------------------------
|      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
lnwage       |
education |   .0351091   .0074637     4.70   0.000     .0204805    .0497376
age |   .0115728   .0039782     2.91   0.004     .0037757      .01937
_cons |   2.159828   .2213499     9.76   0.000      1.72599    2.593666
-------------+----------------------------------------------------------------
select       |
children |   .5907552    .119561     4.94   0.000       .35642    .8250904
education |   .0475423   .0426328     1.12   0.265    -.0360165    .1311011
age |   .0842936   .0297379     2.83   0.005     .0260084    .1425788
_cons |  -4.228175   1.466693    -2.88   0.004    -7.102841    -1.35351
-------------+----------------------------------------------------------------
/athrho |   .3280496   .2852638     1.15   0.250    -.2310572    .8871564
/lnsigma |  -1.383954   .0590332   -23.44   0.000    -1.499657   -1.268251
-------------+----------------------------------------------------------------
rho |   .3167672   .2566401                     -.2270313    .7099864
sigma |   .2505858   .0147929                      .2232067    .2813233
lambda |   .0793774   .0661307                     -.0502364    .2089911
------------------------------------------------------------------------------
LR test of indep. eqns. (rho = 0):   chi2(1) =     1.03   Prob > chi2 = 0.3097
------------------------------------------------------------------------------

. himod, ds

Coefficients, means & predictions for high model

------------------------------------------------------
Variable |    Coefficent        Mean    Prediction
-------------+----------------------------------------
education |         0.035      14.820         0.520
age |         0.012      43.620         0.505
_cons |         2.160       1.000         2.160
------------------------------------------------------

Prediction (ln):     3.185
Prediction (\$):     24.17

. heckman lnwage educ age, select(married children educ age), if county==1

Iteration 0:   log likelihood = -105.65156
Iteration 1:   log likelihood = -105.44248
Iteration 2:   log likelihood =  -105.4423
Iteration 3:   log likelihood =  -105.4423

Heckman selection model                         Number of obs      =       200
(regression model with sample selection)        Censored obs       =        87
Uncensored obs     =       113

Wald chi2(2)       =     27.98
Log likelihood = -105.4423                      Prob > chi2        =    0.0000

------------------------------------------------------------------------------
|      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
lnwage       |
education |   .0404733   .0085642     4.73   0.000     .0236878    .0572588
age |   .0077226   .0026888     2.87   0.004     .0024527    .0129925
_cons |   2.231897   .1482204    15.06   0.000      1.94139    2.522403
-------------+----------------------------------------------------------------
select       |
married |   .9627806   .2389799     4.03   0.000     .4943886    1.431173
children |   .6902933   .0953078     7.24   0.000     .5034935    .8770932
education |   .0983743   .0361862     2.72   0.007     .0274507     .169298
age |   .0320238   .0118514     2.70   0.007     .0087954    .0552522
_cons |  -3.221248   .6438905    -5.00   0.000     -4.48325   -1.959246
-------------+----------------------------------------------------------------
/athrho |   .6845914   .2330463     2.94   0.003      .227829    1.141354
/lnsigma |  -1.303502   .0810706   -16.08   0.000    -1.462398   -1.144607
-------------+----------------------------------------------------------------
rho |   .5944962   .1506818                      .2239672    .8148694
sigma |    .271579   .0220171                      .2316801    .3183491
lambda |   .1614527   .0497236                      .0639962    .2589092
------------------------------------------------------------------------------
LR test of indep. eqns. (rho = 0):   chi2(1) =     7.33   Prob > chi2 = 0.0068
------------------------------------------------------------------------------

. lomod, ds

Coefficients, means & predictions for low model

------------------------------------------------------
Variable |    Coefficent        Mean    Prediction
-------------+----------------------------------------
education |         0.040      11.480         0.465
age |         0.008      30.865         0.238
_cons |         2.232       1.000         2.232
------------------------------------------------------

Prediction (ln):     2.935
Prediction (\$):     18.82

. decomp

Decomposition results for variables (as %s)

------------------------------------------------------
Variable |        Attrib       Endow         Coeff
-------------+----------------------------------------
education |           5.6        11.7          -6.2
age |          26.6        14.8          11.9
-------------+----------------------------------------
Subtotal |          32.2        26.5           5.7
------------------------------------------------------

Summary of decomposition results (as %)

-------------------------------------------
Amount attributable:             |     32.2
- due to endowments (E):         |     26.5
- due to coefficients (C):       |      5.7
Shift coefficient (U):           |     -7.2
Raw differential (R) {E+C+U}:    |     25.0
Adjusted differential (D) {C+U}: |     -1.5
---------------------------------+---------
Endowments as % total (E/R):     |    105.9
Discrimination as % total (D/R): |     -5.9
-------------------------------------------

U = unexplained portion of differential
(difference between model constants)
D = portion due to discrimination (C+U)

positive number indicates advantage to high group
negative number indicates advantage to low group

References

Alan S. Blinder (1973) 'Wage Discrimination: Reduced Form and Structural
Estimates', Journal of Human Resources, 18:4, Fall, 436-455.

Ronald Oaxaca (1973) 'Male-Female Wage Differentials in Urban Labor
Markets', International Economic Review, 14:3, October, 693-709.

Note on versions

Version 1.7 of decomp has been written for Stata Release 8.2. It differs
from Version 1.6 in only one respect. If fixes a bug whereby when using
selection models, decomp was using the full sample, rather than the wage
sample (ie. outcome sample). This has now been corrected. Two temporary
variables are used for this: __fullsample and __wagesample. These are
unlikely to already exist in the user's dataset and they are removed when
himod and lomod conclude.  Thanks to Anne Busch for drawing this to my
attention.

Author

Ian Watson
Freelance researcher and
Visiting Senior Research Fellow
Macquarie University
Sydney Australia
mail@ianwatson.com.au
www.ianwatson.com.au

```