{smcl} {* 8nov2010} {hline} help for {hi:decomp} {hline} {title:Decomposition of wage gaps} {p} Syntax involves a sequence of steps: {p 8 14}{cmd:regress} {it:varlist} [{it:weight}] {cmd:if} {it:exp} (where {it:exp} is group==high wage, for example, race==1) {p 8} {cmd:himod} [{it:weight}] [,{cmd:ds}] {p 8 14}{cmd:regress} {it:varlist} [{it:weight}] {cmd:if} {it:exp} (where {it:exp} is group==low wage, for example, race==2) {p 8} {cmd:lomod} [{it:weight}] [,{cmd:ds}] {p 8} {cmd:decomp} [,{cmd:r}] {p 8 14} {cmd: aweight}s and {cmd:fweight}s are allowed; see {help weights}. {title:Description} {p 5 5}{cmd:decomp} computes Blinder-Oaxaca wage decompositions. It compares the results from two regressions, using intermediate commands ({cmd:himod} and {cmd:lomod}), and produces a table of output containing the decompositions. These decompositions show how much of the wage gap is due to differing endowments between the two groups, and how much is due to discrimination (regarded as the portion of the wage gap due to the combined effect of coefficients and slope intercepts for the two groups). {p 5 5}{cmd:decomp} is designed for Stata's {help regress} command, but also works with other regression commands, such as {help ivreg} and {help tobit}. The previous version required a {cmd:heck} option if {cmd:decomp} was used with Stata's {help heckman} command. This is no longer necessary. {cmd:decomp} now recognises if the regression is a heckman type and takes account of this. This is also the case with tobit regression, which {cmd:decomp} also automatically recognises. This means that the only option which may be specified with {cmd:himod} or {cmd:lomod} is {cmd:ds}. Existing user syntax containing the {cmd:heck} option should be edited to remove this term. {p 5 5} See {net "describe http://fmwww.bc.edu/RePEc/bocode/o/oaxaca":oaxaca} by Ben Jann for a package which is far more comprehensive and up-to-date than {cmd: decomp}. {title:Options} {p 5 5}Option for {cmd:himod} and {cmd:lomod} is {cmd:ds} (details).This provides a table of coefficients, means and predictions for each of the regressions. These are the data used by {cmd:decomp} to conduct the decomposition. {p 5 5} Options for {cmd:decomp} are {cmd:r} (reverse), which computes the decomposition with the low-wage group as the reference point. See below for more details. {p 5 5} To make use of weighting, weights (either {cmd:aweight}s or {cmd:fweight}s) must be applied in the regression commands, and then repeated in the {cmd:himod} and {cmd:lomod} routines. No weights should be specified when {cmd:decomp} itself is run. {title:Method} {p 5 5}In essence, the Blinder-Oaxaca decomposition breaks down the wage gap between high-wage and low-wage workers into several components. The unexplained component is the difference in the shift coefficients (or constants) between the two wage equations. Being inexplicable, this component can be attributed to discrimination. However, Blinder also argued that the explained component of the wage gap also contains a portion that is due to discrimination. To examine this Blinder decomposed the explained component into: {p 10 13 10}1. the differences in endowments between the two groups, {it:"as evaluated} {it:by the high-wage group's wage equation"} ; and {p 10 13 10}2. "the difference between how the high-wage equation {it:would value} the characteristics of the low-wage group, and how the low-wage equation {it:actually values} them". {p 5 5}Blinder called the first part the amount "attributable to the endowments" and the second part the amount "attributable to the coefficients", and he argued that the second part should also be viewed as reflecting discrimination: {p 10 10 10}"[this] only exists because the market evaluates differently the identical bundle of traits if possessed by members of different demographic groups, [and] is a reflection of discrimination as much as the shift coefficient is." {p 5 5}{cmd:decomp} closely follows Blinder's exposition and uses both his method and his terminology. {cmd: decomp} takes the average endowment differences between the two groups and weights them (multiplies them) by the high-wage workers'estimated coefficients. The differences in the estimated coefficients are weighted (multiplied by) the average characteristics of the low-wage workers. {p 5 5}Conventionally, the high-wage group's wage structure is regarded as the "non-discriminatory norm", that is, the reference group. With the reverse option ({cmd:r}) switched on, the low-wage group becomes the reference group. The average endowment differences are now weighted by the low-wage workers' estimated coefficients, and the coefficient differences are weighted by the mean characteristics of the high-wage workers. {p 5 5} The results from {cmd: decomp} are presented using Blinder's (1973) original formulation of E, C, U and D. {p 5 5} The endowments (E) component of the decomposition is the sum of (the coefficient vector of the regressors of the high-wage group) times (the difference in group means between the high-wage and low-wage groups for the vector of regressors). {p 5 5} The coefficients (C) component of the decomposition is the sum of the (group means of the low-wage group for the vector of regressors) times (the difference between the regression coefficients of the high-wage group and the low-wage group). {p 5 5} The unexplained portion of the differential (U) is the difference in constants between the high-wage wage and the low-wage group. {p 5 5} The portion of the differential due to discrimination is C + U. {p 5 5} The raw (or total) differential is E + C + U. {title:Examples} {hline} {p} Using {help regress} in a wage equation where high wage and low wage is based on race: {cmd:. use http://www.stata-press.com/data/r8/nlswork} (National Longitudinal Survey. Young Women 14-26 years of age in 1968) {cmd:. keep if year==88} (26262 observations deleted) {cmd:. reg ln_wage age tenure collgrad if race==1} Source | SS df MS Number of obs = 1636 -------------+------------------------------ F( 3, 1632) = 90.03 Model | 81.4751215 3 27.1583738 Prob > F = 0.0000 Residual | 492.287598 1632 .301646812 R-squared = 0.1420 -------------+------------------------------ Adj R-squared = 0.1404 Total | 573.762719 1635 .350925211 Root MSE = .54922 ------------------------------------------------------------------------------ ln_wage | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- age | -.0071553 .0044197 -1.62 0.106 -.0158241 .0015136 tenure | .0292267 .0024998 11.69 0.000 .0243235 .0341298 collgrad | .3271635 .0311724 10.50 0.000 .2660213 .3883057 _cons | 1.953557 .1737702 11.24 0.000 1.612721 2.294393 ------------------------------------------------------------------------------ {cmd: himod, ds} Coefficients, means & predictions for high model ------------------------------------------------------ Variable | Coefficent Mean Prediction -------------+---------------------------------------- age | -0.007 39.263 -0.281 tenure | 0.029 5.802 0.170 collgrad | 0.327 0.257 0.084 _cons | 1.954 1.000 1.954 ------------------------------------------------------ Prediction (ln): 1.926 Prediction ($): 6.86 {cmd:. reg ln_wage age tenure collgrad if race==2} Source | SS df MS Number of obs = 580 -------------+------------------------------ F( 3, 576) = 59.86 Model | 45.9587803 3 15.3195934 Prob > F = 0.0000 Residual | 147.408098 576 .255916836 R-squared = 0.2377 -------------+------------------------------ Adj R-squared = 0.2337 Total | 193.366878 579 .333966974 Root MSE = .50588 ------------------------------------------------------------------------------ ln_wage | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- age | -.0091953 .007085 -1.30 0.195 -.0231109 .0047204 tenure | .0267151 .0037902 7.05 0.000 .0192708 .0341593 collgrad | .5721103 .0558089 10.25 0.000 .4624966 .681724 _cons | 1.842348 .2754947 6.69 0.000 1.301252 2.383445 ------------------------------------------------------------------------------ {cmd:. lomod, ds} Coefficients, means & predictions for low model ------------------------------------------------------ Variable | Coefficent Mean Prediction -------------+---------------------------------------- age | -0.009 38.828 -0.357 tenure | 0.027 6.490 0.173 collgrad | 0.572 0.176 0.101 _cons | 1.842 1.000 1.842 ------------------------------------------------------ Prediction (ln): 1.759 Prediction ($): 5.81 {cmd:. decomp} Decomposition results for variables (as %s) ------------------------------------------------------ Variable | Attrib Endow Coeff -------------+---------------------------------------- age | 7.6 -0.3 7.9 tenure | -0.4 -2.0 1.6 collgrad | -1.6 2.7 -4.3 -------------+---------------------------------------- Subtotal | 5.6 0.3 5.2 ------------------------------------------------------ Summary of decomposition results (as %) ------------------------------------------- Amount attributable: | 5.6 - due to endowments (E): | 0.3 - due to coefficients (C): | 5.2 Shift coefficient (U): | 11.1 Raw differential (R) {E+C+U}: | 16.7 Adjusted differential (D) {C+U}: | 16.4 ---------------------------------+--------- Endowments as % total (E/R): | 2.0 Discrimination as % total (D/R): | 98.0 ------------------------------------------- U = unexplained portion of differential (difference between model constants) D = portion due to discrimination (C+U) positive number indicates advantage to high group negative number indicates advantage to low group {p 5 5 5}{it:Interpreting the results:} {p 5 5 5}By comparing the output from the two regression equations is is clear that white workers have higher constants and this is reflected in the 11.1% advantage in U (the shift coefficient). White workers also have higher returns to age and tenure, but not to college graduation. Nevertheless, the size of the age coefficient is such as to offset this last factor, leaving white workers with a net advantage in C of 5.2%. There is little difference in endowments between the two groups, something evident from a comparison of the {cmd:himod} and {cmd:lomod} output, which shows that there is little difference (apart from college graduation) between the average group characteristics of white and black workers. This lack of group differences is reflected in the small figure for E, just 0.3%. {p 5 5 5}Consequently, there is little difference between the raw differential (16.7%) and the adjusted differential (16.4%) because the difference in endowments between white and black workers is so small. In other words, almost all of the difference (98%) is due to discrimination, and this is made up of the difference in the shift coefficient (U) and differences in how the endowments are rewarded (C). {hline} {p} Using {help heckman} in a wage equation where high wage and low wage is based on county. Note the absence of the earlier {cmd:heck} option. {cmd:. use http://www.stata-press.com/data/r8/womenwk} (657 missing values generated) {cmd:. heckman lnwage educ age, select(married children educ age), if county==9} note: married dropped due to collinearity Iteration 0: log likelihood = -74.063916 Iteration 1: log likelihood = -74.036062 Iteration 2: log likelihood = -74.036026 Iteration 3: log likelihood = -74.036026 Heckman selection model Number of obs = 200 (regression model with sample selection) Censored obs = 36 Uncensored obs = 164 Wald chi2(2) = 28.14 Log likelihood = -74.03603 Prob > chi2 = 0.0000 ------------------------------------------------------------------------------ | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- lnwage | education | .0351091 .0074637 4.70 0.000 .0204805 .0497376 age | .0115728 .0039782 2.91 0.004 .0037757 .01937 _cons | 2.159828 .2213499 9.76 0.000 1.72599 2.593666 -------------+---------------------------------------------------------------- select | children | .5907552 .119561 4.94 0.000 .35642 .8250904 education | .0475423 .0426328 1.12 0.265 -.0360165 .1311011 age | .0842936 .0297379 2.83 0.005 .0260084 .1425788 _cons | -4.228175 1.466693 -2.88 0.004 -7.102841 -1.35351 -------------+---------------------------------------------------------------- /athrho | .3280496 .2852638 1.15 0.250 -.2310572 .8871564 /lnsigma | -1.383954 .0590332 -23.44 0.000 -1.499657 -1.268251 -------------+---------------------------------------------------------------- rho | .3167672 .2566401 -.2270313 .7099864 sigma | .2505858 .0147929 .2232067 .2813233 lambda | .0793774 .0661307 -.0502364 .2089911 ------------------------------------------------------------------------------ LR test of indep. eqns. (rho = 0): chi2(1) = 1.03 Prob > chi2 = 0.3097 ------------------------------------------------------------------------------ {cmd:. himod, ds} Coefficients, means & predictions for high model ------------------------------------------------------ Variable | Coefficent Mean Prediction -------------+---------------------------------------- education | 0.035 14.820 0.520 age | 0.012 43.620 0.505 _cons | 2.160 1.000 2.160 ------------------------------------------------------ Prediction (ln): 3.185 Prediction ($): 24.17 {cmd:. heckman lnwage educ age, select(married children educ age), if county==1} Iteration 0: log likelihood = -105.65156 Iteration 1: log likelihood = -105.44248 Iteration 2: log likelihood = -105.4423 Iteration 3: log likelihood = -105.4423 Heckman selection model Number of obs = 200 (regression model with sample selection) Censored obs = 87 Uncensored obs = 113 Wald chi2(2) = 27.98 Log likelihood = -105.4423 Prob > chi2 = 0.0000 ------------------------------------------------------------------------------ | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- lnwage | education | .0404733 .0085642 4.73 0.000 .0236878 .0572588 age | .0077226 .0026888 2.87 0.004 .0024527 .0129925 _cons | 2.231897 .1482204 15.06 0.000 1.94139 2.522403 -------------+---------------------------------------------------------------- select | married | .9627806 .2389799 4.03 0.000 .4943886 1.431173 children | .6902933 .0953078 7.24 0.000 .5034935 .8770932 education | .0983743 .0361862 2.72 0.007 .0274507 .169298 age | .0320238 .0118514 2.70 0.007 .0087954 .0552522 _cons | -3.221248 .6438905 -5.00 0.000 -4.48325 -1.959246 -------------+---------------------------------------------------------------- /athrho | .6845914 .2330463 2.94 0.003 .227829 1.141354 /lnsigma | -1.303502 .0810706 -16.08 0.000 -1.462398 -1.144607 -------------+---------------------------------------------------------------- rho | .5944962 .1506818 .2239672 .8148694 sigma | .271579 .0220171 .2316801 .3183491 lambda | .1614527 .0497236 .0639962 .2589092 ------------------------------------------------------------------------------ LR test of indep. eqns. (rho = 0): chi2(1) = 7.33 Prob > chi2 = 0.0068 ------------------------------------------------------------------------------ {cmd:. lomod, ds} Coefficients, means & predictions for low model ------------------------------------------------------ Variable | Coefficent Mean Prediction -------------+---------------------------------------- education | 0.040 11.480 0.465 age | 0.008 30.865 0.238 _cons | 2.232 1.000 2.232 ------------------------------------------------------ Prediction (ln): 2.935 Prediction ($): 18.82 {cmd:. decomp} Decomposition results for variables (as %s) ------------------------------------------------------ Variable | Attrib Endow Coeff -------------+---------------------------------------- education | 5.6 11.7 -6.2 age | 26.6 14.8 11.9 -------------+---------------------------------------- Subtotal | 32.2 26.5 5.7 ------------------------------------------------------ Summary of decomposition results (as %) ------------------------------------------- Amount attributable: | 32.2 - due to endowments (E): | 26.5 - due to coefficients (C): | 5.7 Shift coefficient (U): | -7.2 Raw differential (R) {E+C+U}: | 25.0 Adjusted differential (D) {C+U}: | -1.5 ---------------------------------+--------- Endowments as % total (E/R): | 105.9 Discrimination as % total (D/R): | -5.9 ------------------------------------------- U = unexplained portion of differential (difference between model constants) D = portion due to discrimination (C+U) positive number indicates advantage to high group negative number indicates advantage to low group {title:References} {p 5 5} Alan S. Blinder (1973) 'Wage Discrimination: Reduced Form and Structural Estimates', Journal of Human Resources, 18:4, Fall, 436-455. {p 5 5} Ronald Oaxaca (1973) 'Male-Female Wage Differentials in Urban Labor Markets', International Economic Review, 14:3, October, 693-709. {title:Note on versions} {p 5 5} Version 1.7 of {cmd:decomp} has been written for Stata Release 8.2. It differs from Version 1.6 in only one respect. If fixes a bug whereby when using selection models, {cmd:decomp} was using the full sample, rather than the wage sample (ie. outcome sample). This has now been corrected. Two temporary variables are used for this: __fullsample and __wagesample. These are unlikely to already exist in the user's dataset and they are removed when himod and lomod conclude. Thanks to Anne Busch for drawing this to my attention. {title:Author} Ian Watson Freelance researcher and Visiting Senior Research Fellow Macquarie University Sydney Australia mail@ianwatson.com.au www.ianwatson.com.au