help for decompose

Decomposition of wage differentials

Standard syntax:

decompose varlist [weight] [if exp] [in range] , by(varname) [ detail estimates lambda(varname) noisy gpooled npooled regress_options ]

aweights, fweights, iweights, and pweights are allowed; see help weights.

Alternative syntax:

decompose , save(high | low | pooled ) decompose [ , detail estimates lambda(varname) ]


Given the results from two regressions (one for each of two groups), decompose computes several decompositions of the outcome variable difference. The decompositions show how much of the gap is due to differing endowments between the two groups, and how much is due to discrimination. Usually this is applied to wage differentials using Mincer type earnings equations.

Standard syntax (varlist and by(varname) specified): Regression models will be estimated for each category of varname prior to the computation of the decomposition.

Alternative syntax: Results from stand-alone estimation commands may be saved using decompose, save(). The command decompose (without varlist, by or save) will capture these results and compute the decomposition.

See decomp by Ian Watson for a similar package.


Common options:

detail additionally displays decomposition results for variables.

estimates additionally displays a table of regressions coefficients and means.

lambda(varname) reduces the mean prediction by the effect of varname at its mean. This might be reasonable if varname is a selection variable.

Standard syntax options:

by(varname) specifies the grouping variable (which may be numeric or string). The group with highest mean on the dependent variable will be compared to each of the other groups.

noisy switches on regression output.

npooled deactivates the estimation of pooled regression models (which are required for the Neumark decomposition; see methods and formulas below).

gpooled requests the estimation of a pooled model over all groups rather than casewise pooled models (note: if by(varname) only specifies two groups this will have no effect).

regress_options control the regression estimation; see help regress.

Alternative syntax options:

save() saves the coefficients, means and the number of cases (or the sum of weights, respectively) of the preceding estimation. Use save(high) for the high group (i.e. the group with the higher mean on the dependent variable), save(low) for the low group, and save(pooled) for the pooled model over both groups. The right-hand-side varlists of the high and low models do not necessarily need to be identical (if, e.g., a selection term is included in one model; note that the consideration of a pooled model is not possible in this case).


Standard syntax:

. decompose lnwage educ exp exp2, by(female) detail estimates

. decompose lnwage educ exp exp2 lbda [pweight=1/prob], by(female) lambda(lbda)

Alternative syntax:

. regress lnwage educ exp exp2 [fweight=pop] if female==0 . decompose, save(high) . regress lnwage educ exp exp2 [fweight=pop] if female==1 . decompose, save(low) . regress lnwage educ exp exp2 [fweight=pop] if inlist(female,0,1) . decompose, save(pooled) . decompose

. regress lnwage educ exp exp2 if female==0 . decompose, save(high) . regress lnwage educ exp exp2 lbda if female==1 . decompose, save(low) . decompose, lambda(lbda) detail

Saved Results

r(fH) proportion of obs. (or sum of wgts) in high group (scalar) r(pred) vector of mean predictions r(decomp) detailed decomposition matrix r(xb) matrix of coefficients and means

Methods and Formulas

Let y1 and y2 be the means of the dependent variable Y, x1 and x2 the row vectors of the means of the explanatory variables X1,...,Xk, and b1 and b2 the column vectors of the coefficient for group 1 (high) and group 2 (low). The raw differential y1-y2 may then be expressed as

R = y1-y2 = (x1-x2)b2 + x2(b1-b2) + (x1-x2)(b1-b2) = E + C + CE

(Winsborough/Dickenson 1971; Jones/Kelley 1984; Daymont/Andrisani 1984), i.e., R is decomposed into a part due to differences in endowments (E), a part due to differences in coefficients (including the intercept) (C), and a part due to interaction between coefficients and endowments (CE). Depending on the model which is assumed to be non-discriminating, these terms may be used to determine the "unexplained" (U; discrimination) and the "explained" (V) part of the differential (the question is how to allocate the interaction term CE). Oaxaca (1973) proposed to assume either the low group model or the high group model as non-discriminating, which leads to U=C+CE and V=E or U=C and V=E+CE, respectively. More generally the decomposition may be written as

y1-y2 = (x1-x2)[D*b1+(I-D)*b2] + [x1*(I-D)+x2*D](b1-b2)

where I is a identity matrix and D is a diagonal matrix of weights. In the two cases proposed by Oaxaca (1973) D is a nullmatrix or equals I, respectively (D=I is also what Blinder 1973 suggested). Reimers (1983) proposed to use the mean coefficients between the low and the high model, i.e. the diagonal elements of D equal 0.5, Cotton (1988) proposed to weight the coefficients by group size, i.e. the diagonal elements of D equal fH, where fH is the relative proportion of subjects in the high group (or sum of weights, if weights are applied). Finally, Neumark (1988) proposed to estimate a pooled model over both groups, which leads to D=diag(bP-b2)*diag(b1-b2)^-1 or

y1-y2 = (x1-x2)bP + [x1(b1-bP)+x2(bP-b2)]

where bP is the column vector of the coefficients in the pooled model.

decompose calculates and displays R, E, C, CE, as well as U and V according to the methods described. The coefficient vectors are taken from "e(b)" returned by the estimation commands, the means of the explanatory variables and group sizes are calculated for "e(sample)" using summarize (weighted if necessary).

Treatment of selection variables: Assume that a selection variable XS appears in both models. If it is not marked out by lambda(XS) it will be treated just as any other variable. If it is marked out, however, the group means of Y will be adjusted for selection, that is

yS1 = y1 - xS1*bS1 yS2 = y2 - xS2*bS2

where xS1 and xS2 are the group means of XS, and bS1 and bS2 the corresponding coefficients. The raw differential will then be

RS = yS1 - yS2 = y1 - y2 - (xS1*bS1 - xS2*bS2)

Now assume that the selection variable XS appears in only one model (as possible via alternative syntax). If XS is not marked out its effect will be fully enclosed in the explained part V in any case (this is accomplished by assuming xS=0 in the other model and bS1=bS2) (see Dolton/Makepeace 1986 for an alternative treatment which I did not get to incorporate yet). If it is marked out, the mean of the corresponding group will be adjusted for selection as described above.


Blinder, A.S. (1973). Wage Discrimination: Reduced Form and Structural Estimates. The Journal of Human Resources 8: 436-455. Cotton, J. (1988). On the Decomposition of Wage Differentials. The Review of Economics and Statistics 70: 236-243. Daymont, T.N., Andrisani, P.J. (1984). Job Preferences, College Major, and the Gender Gap in Earnings. The Journal of Human Resources 19: 408-428. Dolton, P.J., Makepeace, G.H. (1986). Sample Selection and Male-Female Earnings Differentials in the Graduate Labour Market. Oxford Economic Papers 38: 317-341. Jones, F.L., Kelley, J. (1984). Decomposing Differences Between Groups. A Cautionary Note on Measuring Discrimination. Sociological Methods and Research 12: 323-343. Neumark, D. (1988). Employers' Discriminatory Behavior and the Estimation of Wage Discrimination. The Journal of Human Resources 23: 279-295. Oaxaca, R. (1973). Male-Female Wage Differentials in Urban Labor Markets. International Economic Review 14: 693-709. Reimers, C.W. (1983). Labor Market Discrimination Against Hispanic and Black Men. The Review of Economics and Statistics 65: 570-579. Winsborough, H.H., Dickenson, P. (1971). Components of Negro-White Income Differences. Proceedings of the American Statistical Association, Social Statistics Section: 6-8.


Ben Jann, ETH Zurich, jann@soz.gess.ethz.ch

Also see

Manual: [R] regress On-line: help for regress