{smcl}
{* 8nov2010}
{hline}
help for {hi:decomp}
{hline}

{title:Decomposition of wage gaps}

{p} Syntax involves a sequence of steps:

{p 8 14}{cmd:regress} {it:varlist} [{it:weight}] {cmd:if} {it:exp} (where
{it:exp} is group==high wage, for example, race==1)

{p 8} {cmd:himod} [{it:weight}] [,{cmd:ds}]

{p 8 14}{cmd:regress} {it:varlist} [{it:weight}] {cmd:if} {it:exp} (where
{it:exp} is group==low wage, for example, race==2)

{p 8} {cmd:lomod} [{it:weight}] [,{cmd:ds}]

{p 8} {cmd:decomp} [,{cmd:r}]

{p 8 14} {cmd: aweight}s and {cmd:fweight}s are allowed; see {help weights}.


{title:Description}

{p 5 5}{cmd:decomp} computes Blinder-Oaxaca wage decompositions. It compares the
results from two regressions, using intermediate commands ({cmd:himod} and
{cmd:lomod}), and produces a  table of output containing the decompositions.
These decompositions show how much of  the wage gap is due to differing
endowments between the two groups, and how much is due to discrimination
(regarded as the portion of the wage gap due to the combined effect of 
coefficients and slope intercepts for the two groups).

{p 5 5}{cmd:decomp} is designed for Stata's {help regress} command, but also
works with other regression commands, such as {help ivreg} and {help tobit}. 
The previous version required a {cmd:heck} option if {cmd:decomp} was
used with Stata's {help heckman} command. This is no longer necessary. {cmd:decomp} now
recognises if the regression is a heckman type and takes account of this. This
is also the case with tobit regression, which {cmd:decomp} also automatically
recognises. This means that the only option which may be specified with 
{cmd:himod} or {cmd:lomod} is {cmd:ds}. Existing user syntax containing 
the {cmd:heck} option should be edited to remove this term.


{p 5 5} See {net "describe http://fmwww.bc.edu/RePEc/bocode/o/oaxaca":oaxaca} by 
Ben Jann for a package which is far more comprehensive and up-to-date than {cmd: decomp}.

{title:Options}

{p 5 5}Option for {cmd:himod} and {cmd:lomod} is {cmd:ds} (details).This provides 
a table of coefficients, means and predictions for each
of the regressions. These are the data used by {cmd:decomp} to conduct the
decomposition. 


{p 5 5} Options for {cmd:decomp} are {cmd:r} (reverse), which computes the
decomposition with the low-wage group as the reference point. See below for more
details.

{p 5 5} To make use of weighting, weights (either {cmd:aweight}s or
{cmd:fweight}s) must be applied in the regression commands, and then repeated in
the {cmd:himod} and {cmd:lomod} routines. No weights should be specified when
{cmd:decomp} itself is run.


{title:Method}

{p 5 5}In essence, the Blinder-Oaxaca decomposition breaks down the wage gap
between high-wage and low-wage workers into several components. The unexplained
component is the difference in the shift coefficients (or constants) between the
two wage equations. Being inexplicable, this component can be attributed to 
discrimination. However, Blinder also argued that the explained component of the
wage gap also contains a portion that is due to discrimination. To examine this Blinder 
decomposed the explained component into:

{p 10 13 10}1. the differences in endowments between the two groups, {it:"as evaluated}
{it:by the high-wage group's wage equation"} ; and

{p 10 13 10}2. "the difference between how the high-wage equation {it:would value} the
characteristics of the low-wage group, and how the low-wage equation {it:actually values} them".

{p 5 5}Blinder called the first part the amount "attributable to the endowments" and the second
part the amount "attributable to the coefficients", and he argued that the second part should
also be viewed as reflecting discrimination:

{p 10 10 10}"[this] only exists because the market evaluates differently the identical
bundle of traits if possessed by members of different demographic groups, [and]
is a reflection of discrimination as much as the shift coefficient is."

{p 5 5}{cmd:decomp} closely follows Blinder's exposition and uses both his method and
his terminology. {cmd: decomp} takes the average endowment differences between the two 
groups and weights them (multiplies them) by the high-wage workers'estimated coefficients. 
The differences in the estimated coefficients are weighted (multiplied by) the average 
characteristics of the low-wage workers.

{p 5 5}Conventionally, the high-wage group's wage structure is regarded as the 
"non-discriminatory norm", that is, the reference group. With the reverse option ({cmd:r})
switched on, the low-wage group becomes the reference group. The average endowment 
differences are now weighted by the low-wage workers' estimated coefficients, 
and the coefficient differences are weighted by the mean characteristics of the 
high-wage workers.

{p 5 5} The results from {cmd: decomp} are presented using Blinder's (1973) original 
formulation of E, C, U and D.

{p 5 5} The endowments (E) component of the decomposition is the sum of (the
coefficient vector of the regressors of the high-wage group) times (the
difference in group means between the high-wage and low-wage groups for the
vector of regressors).

{p 5 5} The coefficients (C) component of the decomposition is the sum of the
(group means of the low-wage group for the vector of regressors) times (the
difference between the regression coefficients of the high-wage group and the
low-wage group).

{p 5 5} The unexplained portion of the differential (U) is the difference in
constants between the high-wage wage and the low-wage group.

{p 5 5} The portion of the differential due to discrimination is C + U.

{p 5 5} The raw (or total) differential is E + C + U.


{title:Examples}

{hline}

{p} Using {help regress} in a wage equation where high wage and low wage is based on race:

{cmd:. use http://www.stata-press.com/data/r8/nlswork}
(National Longitudinal Survey.  Young Women 14-26 years of age in 1968)

{cmd:. keep if year==88}
(26262 observations deleted)

{cmd:. reg ln_wage age tenure collgrad if race==1}

      Source |       SS       df       MS              Number of obs =    1636
-------------+------------------------------           F(  3,  1632) =   90.03
       Model |  81.4751215     3  27.1583738           Prob > F      =  0.0000
    Residual |  492.287598  1632  .301646812           R-squared     =  0.1420
-------------+------------------------------           Adj R-squared =  0.1404
       Total |  573.762719  1635  .350925211           Root MSE      =  .54922

------------------------------------------------------------------------------
     ln_wage |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         age |  -.0071553   .0044197    -1.62   0.106    -.0158241    .0015136
      tenure |   .0292267   .0024998    11.69   0.000     .0243235    .0341298
    collgrad |   .3271635   .0311724    10.50   0.000     .2660213    .3883057
       _cons |   1.953557   .1737702    11.24   0.000     1.612721    2.294393
------------------------------------------------------------------------------

{cmd: himod, ds}

Coefficients, means & predictions for high model

------------------------------------------------------
    Variable |    Coefficent        Mean    Prediction
-------------+----------------------------------------
         age |        -0.007      39.263        -0.281
      tenure |         0.029       5.802         0.170
    collgrad |         0.327       0.257         0.084
       _cons |         1.954       1.000         1.954
------------------------------------------------------

Prediction (ln):     1.926
Prediction ($):      6.86

{cmd:. reg ln_wage age tenure collgrad if race==2}

      Source |       SS       df       MS              Number of obs =     580
-------------+------------------------------           F(  3,   576) =   59.86
       Model |  45.9587803     3  15.3195934           Prob > F      =  0.0000
    Residual |  147.408098   576  .255916836           R-squared     =  0.2377
-------------+------------------------------           Adj R-squared =  0.2337
       Total |  193.366878   579  .333966974           Root MSE      =  .50588

------------------------------------------------------------------------------
     ln_wage |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         age |  -.0091953    .007085    -1.30   0.195    -.0231109    .0047204
      tenure |   .0267151   .0037902     7.05   0.000     .0192708    .0341593
    collgrad |   .5721103   .0558089    10.25   0.000     .4624966     .681724
       _cons |   1.842348   .2754947     6.69   0.000     1.301252    2.383445
------------------------------------------------------------------------------

{cmd:. lomod, ds}

Coefficients, means & predictions for low model

------------------------------------------------------
    Variable |    Coefficent        Mean    Prediction
-------------+----------------------------------------
         age |        -0.009      38.828        -0.357
      tenure |         0.027       6.490         0.173
    collgrad |         0.572       0.176         0.101
       _cons |         1.842       1.000         1.842
------------------------------------------------------

Prediction (ln):     1.759
Prediction ($):      5.81

{cmd:. decomp}

Decomposition results for variables (as %s)

------------------------------------------------------
    Variable |        Attrib       Endow         Coeff
-------------+----------------------------------------
         age |           7.6        -0.3           7.9
      tenure |          -0.4        -2.0           1.6
    collgrad |          -1.6         2.7          -4.3
-------------+----------------------------------------
    Subtotal |           5.6         0.3           5.2
------------------------------------------------------



Summary of decomposition results (as %)

-------------------------------------------
Amount attributable:             |      5.6
- due to endowments (E):         |      0.3
- due to coefficients (C):       |      5.2
Shift coefficient (U):           |     11.1
Raw differential (R) {E+C+U}:    |     16.7
Adjusted differential (D) {C+U}: |     16.4
---------------------------------+---------
Endowments as % total (E/R):     |      2.0
Discrimination as % total (D/R): |     98.0
-------------------------------------------

  U = unexplained portion of differential
      (difference between model constants)
  D = portion due to discrimination (C+U)

  positive number indicates advantage to high group
  negative number indicates advantage to low group

{p 5 5 5}{it:Interpreting the results:} 

{p 5 5 5}By comparing the output from the two
regression equations is is clear that white workers have higher constants and
this is reflected in the 11.1% advantage in U (the shift coefficient). White
workers also have higher returns to age and tenure, but not to college
graduation. Nevertheless, the size of the age coefficient is such as to offset
this last factor, leaving white workers with a net advantage in C of 5.2%.
There is little difference in endowments between the two groups, something evident
from a comparison of the {cmd:himod} and {cmd:lomod} output, which shows that there is
little difference (apart from college graduation) between the average group
characteristics of white and black workers. This lack of group differences is
reflected in the small figure for E, just 0.3%.

{p 5 5 5}Consequently, there is little difference between the
raw differential (16.7%) and the adjusted differential (16.4%) because the
difference in endowments between white and black workers is so small. In other
words, almost all of the difference (98%) is due to discrimination, and this is
made up of the difference in the shift coefficient (U) and differences in how the
endowments are rewarded (C).

{hline}

{p} Using {help heckman} in a wage equation where high wage and low wage is based on county. 
Note the absence of the earlier {cmd:heck} option.

{cmd:. use http://www.stata-press.com/data/r8/womenwk}
(657 missing values generated)

{cmd:. heckman lnwage educ age, select(married children educ age), if county==9}
note: married dropped due to collinearity

Iteration 0:   log likelihood = -74.063916  
Iteration 1:   log likelihood = -74.036062  
Iteration 2:   log likelihood = -74.036026  
Iteration 3:   log likelihood = -74.036026  

Heckman selection model                         Number of obs      =       200
(regression model with sample selection)        Censored obs       =        36
                                                Uncensored obs     =       164

                                                Wald chi2(2)       =     28.14
Log likelihood = -74.03603                      Prob > chi2        =    0.0000

------------------------------------------------------------------------------
             |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
lnwage       |
   education |   .0351091   .0074637     4.70   0.000     .0204805    .0497376
         age |   .0115728   .0039782     2.91   0.004     .0037757      .01937
       _cons |   2.159828   .2213499     9.76   0.000      1.72599    2.593666
-------------+----------------------------------------------------------------
select       |
    children |   .5907552    .119561     4.94   0.000       .35642    .8250904
   education |   .0475423   .0426328     1.12   0.265    -.0360165    .1311011
         age |   .0842936   .0297379     2.83   0.005     .0260084    .1425788
       _cons |  -4.228175   1.466693    -2.88   0.004    -7.102841    -1.35351
-------------+----------------------------------------------------------------
     /athrho |   .3280496   .2852638     1.15   0.250    -.2310572    .8871564
    /lnsigma |  -1.383954   .0590332   -23.44   0.000    -1.499657   -1.268251
-------------+----------------------------------------------------------------
         rho |   .3167672   .2566401                     -.2270313    .7099864
       sigma |   .2505858   .0147929                      .2232067    .2813233
      lambda |   .0793774   .0661307                     -.0502364    .2089911
------------------------------------------------------------------------------
LR test of indep. eqns. (rho = 0):   chi2(1) =     1.03   Prob > chi2 = 0.3097
------------------------------------------------------------------------------

{cmd:. himod, ds}

Coefficients, means & predictions for high model

------------------------------------------------------
    Variable |    Coefficent        Mean    Prediction
-------------+----------------------------------------
   education |         0.035      14.820         0.520
         age |         0.012      43.620         0.505
       _cons |         2.160       1.000         2.160
------------------------------------------------------

Prediction (ln):     3.185
Prediction ($):     24.17

{cmd:. heckman lnwage educ age, select(married children educ age), if county==1}

Iteration 0:   log likelihood = -105.65156  
Iteration 1:   log likelihood = -105.44248  
Iteration 2:   log likelihood =  -105.4423  
Iteration 3:   log likelihood =  -105.4423  

Heckman selection model                         Number of obs      =       200
(regression model with sample selection)        Censored obs       =        87
                                                Uncensored obs     =       113

                                                Wald chi2(2)       =     27.98
Log likelihood = -105.4423                      Prob > chi2        =    0.0000

------------------------------------------------------------------------------
             |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
lnwage       |
   education |   .0404733   .0085642     4.73   0.000     .0236878    .0572588
         age |   .0077226   .0026888     2.87   0.004     .0024527    .0129925
       _cons |   2.231897   .1482204    15.06   0.000      1.94139    2.522403
-------------+----------------------------------------------------------------
select       |
     married |   .9627806   .2389799     4.03   0.000     .4943886    1.431173
    children |   .6902933   .0953078     7.24   0.000     .5034935    .8770932
   education |   .0983743   .0361862     2.72   0.007     .0274507     .169298
         age |   .0320238   .0118514     2.70   0.007     .0087954    .0552522
       _cons |  -3.221248   .6438905    -5.00   0.000     -4.48325   -1.959246
-------------+----------------------------------------------------------------
     /athrho |   .6845914   .2330463     2.94   0.003      .227829    1.141354
    /lnsigma |  -1.303502   .0810706   -16.08   0.000    -1.462398   -1.144607
-------------+----------------------------------------------------------------
         rho |   .5944962   .1506818                      .2239672    .8148694
       sigma |    .271579   .0220171                      .2316801    .3183491
      lambda |   .1614527   .0497236                      .0639962    .2589092
------------------------------------------------------------------------------
LR test of indep. eqns. (rho = 0):   chi2(1) =     7.33   Prob > chi2 = 0.0068
------------------------------------------------------------------------------

{cmd:. lomod, ds}

Coefficients, means & predictions for low model

------------------------------------------------------
    Variable |    Coefficent        Mean    Prediction
-------------+----------------------------------------
   education |         0.040      11.480         0.465
         age |         0.008      30.865         0.238
       _cons |         2.232       1.000         2.232
------------------------------------------------------

Prediction (ln):     2.935
Prediction ($):     18.82

{cmd:. decomp}

Decomposition results for variables (as %s)

------------------------------------------------------
    Variable |        Attrib       Endow         Coeff
-------------+----------------------------------------
   education |           5.6        11.7          -6.2
         age |          26.6        14.8          11.9
-------------+----------------------------------------
    Subtotal |          32.2        26.5           5.7
------------------------------------------------------



Summary of decomposition results (as %)

-------------------------------------------
Amount attributable:             |     32.2
- due to endowments (E):         |     26.5
- due to coefficients (C):       |      5.7
Shift coefficient (U):           |     -7.2
Raw differential (R) {E+C+U}:    |     25.0
Adjusted differential (D) {C+U}: |     -1.5
---------------------------------+---------
Endowments as % total (E/R):     |    105.9
Discrimination as % total (D/R): |     -5.9
-------------------------------------------

  U = unexplained portion of differential
      (difference between model constants)
  D = portion due to discrimination (C+U)

  positive number indicates advantage to high group
  negative number indicates advantage to low group



{title:References}

{p 5 5} Alan S. Blinder (1973) 'Wage Discrimination: Reduced Form and
Structural Estimates', Journal of Human Resources, 18:4, Fall,
436-455.

{p 5 5} Ronald Oaxaca (1973) 'Male-Female Wage Differentials in Urban
Labor Markets', International Economic Review, 14:3, October,
693-709.

{title:Note on versions}

{p 5 5} Version 1.7 of {cmd:decomp} has been written for Stata Release 8.2. It 
differs from Version 1.6 in only one respect. If fixes a bug whereby when using
selection models, {cmd:decomp} was using the full sample, rather than the wage 
sample (ie. outcome sample). This has now been corrected. Two temporary variables
are used for this: __fullsample and __wagesample. These are unlikely to already 
exist in the user's dataset and they are removed when himod and lomod conclude.
Thanks to Anne Busch for drawing this to my attention.



{title:Author}

   Ian Watson
   Freelance researcher and
   Visiting Senior Research Fellow
   Macquarie University
   Sydney Australia
   mail@ianwatson.com.au
   www.ianwatson.com.au