------------------------------------------------------------------------------- help for decomp -------------------------------------------------------------------------------
Decomposition of wage gaps
Syntax involves a sequence of steps:
regress varlist [weight] if exp (where exp is group==high wage, for example, race==1)
himod [weight] [,ds]
regress varlist [weight] if exp (where exp is group==low wage, for example, race==2)
lomod [weight] [,ds]
decomp [,r]
aweights and fweights are allowed; see weights.
Description
decomp computes Blinder-Oaxaca wage decompositions. It compares the results from two regressions, using intermediate commands (himod and lomod), and produces a table of output containing the decompositions. These decompositions show how much of the wage gap is due to differing endowments between the two groups, and how much is due to discrimination (regarded as the portion of the wage gap due to the combined effect of coefficients and slope intercepts for the two groups).
decomp is designed for Stata's regress command, but also works with other regression commands, such as ivreg and tobit. The previous version required a heck option if decomp was used with Stata's heckman command. This is no longer necessary. decomp now recognises if the regression is a heckman type and takes account of this. This is also the case with tobit regression, which decomp also automatically recognises. This means that the only option which may be specified with himod or lomod is ds. Existing user syntax containing the heck option should be edited to remove this term.
See oaxaca by Ben Jann for a package which is far more comprehensive and up-to-date than decomp.
Options
Option for himod and lomod is ds (details).This provides a table of coefficients, means and predictions for each of the regressions. These are the data used by decomp to conduct the decomposition.
Options for decomp are r (reverse), which computes the decomposition with the low-wage group as the reference point. See below for more details.
To make use of weighting, weights (either aweights or fweights) must be applied in the regression commands, and then repeated in the himod and lomod routines. No weights should be specified when decomp itself is run.
Method
In essence, the Blinder-Oaxaca decomposition breaks down the wage gap between high-wage and low-wage workers into several components. The unexplained component is the difference in the shift coefficients (or constants) between the two wage equations. Being inexplicable, this component can be attributed to discrimination. However, Blinder also argued that the explained component of the wage gap also contains a portion that is due to discrimination. To examine this Blinder decomposed the explained component into:
1. the differences in endowments between the two groups, "as evaluated by the high-wage group's wage equation" ; and
2. "the difference between how the high-wage equation would value the characteristics of the low-wage group, and how the low-wage equation actually values them".
Blinder called the first part the amount "attributable to the endowments" and the second part the amount "attributable to the coefficients", and he argued that the second part should also be viewed as reflecting discrimination:
"[this] only exists because the market evaluates differently the identical bundle of traits if possessed by members of different demographic groups, [and] is a reflection of discrimination as much as the shift coefficient is."
decomp closely follows Blinder's exposition and uses both his method and his terminology. decomp takes the average endowment differences between the two groups and weights them (multiplies them) by the high-wage workers'estimated coefficients. The differences in the estimated coefficients are weighted (multiplied by) the average characteristics of the low-wage workers.
Conventionally, the high-wage group's wage structure is regarded as the "non-discriminatory norm", that is, the reference group. With the reverse option (r) switched on, the low-wage group becomes the reference group. The average endowment differences are now weighted by the low-wage workers' estimated coefficients, and the coefficient differences are weighted by the mean characteristics of the high-wage workers.
The results from decomp are presented using Blinder's (1973) original formulation of E, C, U and D.
The endowments (E) component of the decomposition is the sum of (the coefficient vector of the regressors of the high-wage group) times (the difference in group means between the high-wage and low-wage groups for the vector of regressors).
The coefficients (C) component of the decomposition is the sum of the (group means of the low-wage group for the vector of regressors) times (the difference between the regression coefficients of the high-wage group and the low-wage group).
The unexplained portion of the differential (U) is the difference in constants between the high-wage wage and the low-wage group.
The portion of the differential due to discrimination is C + U.
The raw (or total) differential is E + C + U.
Examples
-------------------------------------------------------------------------------
Using regress in a wage equation where high wage and low wage is based on race:
. use http://www.stata-press.com/data/r8/nlswork (National Longitudinal Survey. Young Women 14-26 years of age in 1968)
. keep if year==88 (26262 observations deleted)
. reg ln_wage age tenure collgrad if race==1
Source | SS df MS Number of obs = 1636 -------------+------------------------------ F( 3, 1632) = 90.03 Model | 81.4751215 3 27.1583738 Prob > F = 0.0000 Residual | 492.287598 1632 .301646812 R-squared = 0.1420 -------------+------------------------------ Adj R-squared = 0.1404 Total | 573.762719 1635 .350925211 Root MSE = .54922
------------------------------------------------------------------------------ ln_wage | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- age | -.0071553 .0044197 -1.62 0.106 -.0158241 .0015136 tenure | .0292267 .0024998 11.69 0.000 .0243235 .0341298 collgrad | .3271635 .0311724 10.50 0.000 .2660213 .3883057 _cons | 1.953557 .1737702 11.24 0.000 1.612721 2.294393 ------------------------------------------------------------------------------
himod, ds
Coefficients, means & predictions for high model
------------------------------------------------------ Variable | Coefficent Mean Prediction -------------+---------------------------------------- age | -0.007 39.263 -0.281 tenure | 0.029 5.802 0.170 collgrad | 0.327 0.257 0.084 _cons | 1.954 1.000 1.954 ------------------------------------------------------
Prediction (ln): 1.926 Prediction ($): 6.86
. reg ln_wage age tenure collgrad if race==2
Source | SS df MS Number of obs = 580 -------------+------------------------------ F( 3, 576) = 59.86 Model | 45.9587803 3 15.3195934 Prob > F = 0.0000 Residual | 147.408098 576 .255916836 R-squared = 0.2377 -------------+------------------------------ Adj R-squared = 0.2337 Total | 193.366878 579 .333966974 Root MSE = .50588
------------------------------------------------------------------------------ ln_wage | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- age | -.0091953 .007085 -1.30 0.195 -.0231109 .0047204 tenure | .0267151 .0037902 7.05 0.000 .0192708 .0341593 collgrad | .5721103 .0558089 10.25 0.000 .4624966 .681724 _cons | 1.842348 .2754947 6.69 0.000 1.301252 2.383445 ------------------------------------------------------------------------------
. lomod, ds
Coefficients, means & predictions for low model
------------------------------------------------------ Variable | Coefficent Mean Prediction -------------+---------------------------------------- age | -0.009 38.828 -0.357 tenure | 0.027 6.490 0.173 collgrad | 0.572 0.176 0.101 _cons | 1.842 1.000 1.842 ------------------------------------------------------
Prediction (ln): 1.759 Prediction ($): 5.81
. decomp
Decomposition results for variables (as %s)
------------------------------------------------------ Variable | Attrib Endow Coeff -------------+---------------------------------------- age | 7.6 -0.3 7.9 tenure | -0.4 -2.0 1.6 collgrad | -1.6 2.7 -4.3 -------------+---------------------------------------- Subtotal | 5.6 0.3 5.2 ------------------------------------------------------
Summary of decomposition results (as %)
------------------------------------------- Amount attributable: | 5.6 - due to endowments (E): | 0.3 - due to coefficients (C): | 5.2 Shift coefficient (U): | 11.1 Raw differential (R) {E+C+U}: | 16.7 Adjusted differential (D) {C+U}: | 16.4 ---------------------------------+--------- Endowments as % total (E/R): | 2.0 Discrimination as % total (D/R): | 98.0 -------------------------------------------
U = unexplained portion of differential (difference between model constants) D = portion due to discrimination (C+U)
positive number indicates advantage to high group negative number indicates advantage to low group
Interpreting the results:
By comparing the output from the two regression equations is is clear that white workers have higher constants and this is reflected in the 11.1% advantage in U (the shift coefficient). White workers also have higher returns to age and tenure, but not to college graduation. Nevertheless, the size of the age coefficient is such as to offset this last factor, leaving white workers with a net advantage in C of 5.2%. There is little difference in endowments between the two groups, something evident from a comparison of the himod and lomod output, which shows that there is little difference (apart from college graduation) between the average group characteristics of white and black workers. This lack of group differences is reflected in the small figure for E, just 0.3%.
Consequently, there is little difference between the raw differential (16.7%) and the adjusted differential (16.4%) because the difference in endowments between white and black workers is so small. In other words, almost all of the difference (98%) is due to discrimination, and this is made up of the difference in the shift coefficient (U) and differences in how the endowments are rewarded (C).
-------------------------------------------------------------------------------
Using heckman in a wage equation where high wage and low wage is based on county. Note the absence of the earlier heck option.
. use http://www.stata-press.com/data/r8/womenwk (657 missing values generated)
. heckman lnwage educ age, select(married children educ age), if county==9 note: married dropped due to collinearity
Iteration 0: log likelihood = -74.063916 Iteration 1: log likelihood = -74.036062 Iteration 2: log likelihood = -74.036026 Iteration 3: log likelihood = -74.036026
Heckman selection model Number of obs = 200 (regression model with sample selection) Censored obs = 36 Uncensored obs = 164
Wald chi2(2) = 28.14 Log likelihood = -74.03603 Prob > chi2 = 0.0000
------------------------------------------------------------------------------ | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- lnwage | education | .0351091 .0074637 4.70 0.000 .0204805 .0497376 age | .0115728 .0039782 2.91 0.004 .0037757 .01937 _cons | 2.159828 .2213499 9.76 0.000 1.72599 2.593666 -------------+---------------------------------------------------------------- select | children | .5907552 .119561 4.94 0.000 .35642 .8250904 education | .0475423 .0426328 1.12 0.265 -.0360165 .1311011 age | .0842936 .0297379 2.83 0.005 .0260084 .1425788 _cons | -4.228175 1.466693 -2.88 0.004 -7.102841 -1.35351 -------------+---------------------------------------------------------------- /athrho | .3280496 .2852638 1.15 0.250 -.2310572 .8871564 /lnsigma | -1.383954 .0590332 -23.44 0.000 -1.499657 -1.268251 -------------+---------------------------------------------------------------- rho | .3167672 .2566401 -.2270313 .7099864 sigma | .2505858 .0147929 .2232067 .2813233 lambda | .0793774 .0661307 -.0502364 .2089911 ------------------------------------------------------------------------------ LR test of indep. eqns. (rho = 0): chi2(1) = 1.03 Prob > chi2 = 0.3097 ------------------------------------------------------------------------------
. himod, ds
Coefficients, means & predictions for high model
------------------------------------------------------ Variable | Coefficent Mean Prediction -------------+---------------------------------------- education | 0.035 14.820 0.520 age | 0.012 43.620 0.505 _cons | 2.160 1.000 2.160 ------------------------------------------------------
Prediction (ln): 3.185 Prediction ($): 24.17
. heckman lnwage educ age, select(married children educ age), if county==1
Iteration 0: log likelihood = -105.65156 Iteration 1: log likelihood = -105.44248 Iteration 2: log likelihood = -105.4423 Iteration 3: log likelihood = -105.4423
Heckman selection model Number of obs = 200 (regression model with sample selection) Censored obs = 87 Uncensored obs = 113
Wald chi2(2) = 27.98 Log likelihood = -105.4423 Prob > chi2 = 0.0000
------------------------------------------------------------------------------ | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- lnwage | education | .0404733 .0085642 4.73 0.000 .0236878 .0572588 age | .0077226 .0026888 2.87 0.004 .0024527 .0129925 _cons | 2.231897 .1482204 15.06 0.000 1.94139 2.522403 -------------+---------------------------------------------------------------- select | married | .9627806 .2389799 4.03 0.000 .4943886 1.431173 children | .6902933 .0953078 7.24 0.000 .5034935 .8770932 education | .0983743 .0361862 2.72 0.007 .0274507 .169298 age | .0320238 .0118514 2.70 0.007 .0087954 .0552522 _cons | -3.221248 .6438905 -5.00 0.000 -4.48325 -1.959246 -------------+---------------------------------------------------------------- /athrho | .6845914 .2330463 2.94 0.003 .227829 1.141354 /lnsigma | -1.303502 .0810706 -16.08 0.000 -1.462398 -1.144607 -------------+---------------------------------------------------------------- rho | .5944962 .1506818 .2239672 .8148694 sigma | .271579 .0220171 .2316801 .3183491 lambda | .1614527 .0497236 .0639962 .2589092 ------------------------------------------------------------------------------ LR test of indep. eqns. (rho = 0): chi2(1) = 7.33 Prob > chi2 = 0.0068 ------------------------------------------------------------------------------
. lomod, ds
Coefficients, means & predictions for low model
------------------------------------------------------ Variable | Coefficent Mean Prediction -------------+---------------------------------------- education | 0.040 11.480 0.465 age | 0.008 30.865 0.238 _cons | 2.232 1.000 2.232 ------------------------------------------------------
Prediction (ln): 2.935 Prediction ($): 18.82
. decomp
Decomposition results for variables (as %s)
------------------------------------------------------ Variable | Attrib Endow Coeff -------------+---------------------------------------- education | 5.6 11.7 -6.2 age | 26.6 14.8 11.9 -------------+---------------------------------------- Subtotal | 32.2 26.5 5.7 ------------------------------------------------------
Summary of decomposition results (as %)
------------------------------------------- Amount attributable: | 32.2 - due to endowments (E): | 26.5 - due to coefficients (C): | 5.7 Shift coefficient (U): | -7.2 Raw differential (R) {E+C+U}: | 25.0 Adjusted differential (D) {C+U}: | -1.5 ---------------------------------+--------- Endowments as % total (E/R): | 105.9 Discrimination as % total (D/R): | -5.9 -------------------------------------------
U = unexplained portion of differential (difference between model constants) D = portion due to discrimination (C+U)
positive number indicates advantage to high group negative number indicates advantage to low group
References
Alan S. Blinder (1973) 'Wage Discrimination: Reduced Form and Structural Estimates', Journal of Human Resources, 18:4, Fall, 436-455.
Ronald Oaxaca (1973) 'Male-Female Wage Differentials in Urban Labor Markets', International Economic Review, 14:3, October, 693-709.
Note on versions
Version 1.7 of decomp has been written for Stata Release 8.2. It differs from Version 1.6 in only one respect. If fixes a bug whereby when using selection models, decomp was using the full sample, rather than the wage sample (ie. outcome sample). This has now been corrected. Two temporary variables are used for this: __fullsample and __wagesample. These are unlikely to already exist in the user's dataset and they are removed when himod and lomod conclude. Thanks to Anne Busch for drawing this to my attention.
Author
Ian Watson Freelance researcher and Visiting Senior Research Fellow Macquarie University Sydney Australia mail@ianwatson.com.au www.ianwatson.com.au