Trend decomposition of outcome differentials
smithwelch est11 est21 est12 est22 [, benchmark(1|2|est1bm est2bm) reference(1|2|estref1 estref2 [estrefbm]) detail[(dlist)] adjust(varlist) eform nonotes ]
where dlist is
name1 = varlist1 [ , name2 = varlist2 [, ... ] ]
Description
smithwelch computes decompositions of differences in mean outcome differentials. Smith and Welch (1989) used such decomposition techniques in their analysis of the change in the black-white wage differential over time. An alternative application would be the decomposition of country differences in the male-female wage gap. Also see Lee (2000) and Heckman et al. (2000).
est11, est21, est12, and est22 specify the previously fitted and stored regression estimates to be used with the decomposition (see help estimates store). The model estimated last may be indicated by a period (.), even if it has not yet been stored. est11 and est21 specify the group 1 estimates (e.g. male, black) and the group 2 estimates (e.g. female, white) for the first sample (e.g. time point 1, country A), est12 and est22 are the group estimates for the second sample (time point 2, country B). Note that the estimation samples (e(sample)) of the specified models determine the relevant observations for the decomposition. Group 1 and group 2 must not overlap.
See the jmpierce2 package (available from the SSC archive; type ssc describe jmpierce2) for an alternative approach for the decomposition of differences in differentials. See the oaxaca package (type ssc describe oaxaca) for a program to compute single differential decompositions.
Options
benchmark(1|2|est1bm est2bm) specifies (the estimates for) the "benchmark" sample. benchmark(1) signifies that sample 1 is the benchmark sample and est11 and est21 are the benchmark estimates. Analogously, est12 and est22 are used as the benchmark, if you specify benchmark(2). Alternatively, use benchmark(est1bm est2bm) to provide the estimates from another sample to be used as the benchmark (e.g. the pooled sample over all time points or countries). If benchmark() is omitted, an extended decomposition containing interaction terms for simultaneous changes in endowments and coefficients is computed. See the Methods and Formulas Section below.
reference(1|2|estref1 estref2 [estrefbm]) determines the reference coefficients within the samples to be used with the decomposition. reference(1) means that the coefficients from the first group (i.e. est11 and est12) are used; reference(2) uses the group 2 estimates (est21 and est22). Alternatively, specify reference(estref1 estref2 [estrefbm]) to provide other reference estimates (e.g. models based on the pooled samples over both groups). estrefbm is required only if benchmark(est1bm est2bm) is specified. If reference() is omitted, an extended decomposition containing interaction terms for the combined effect of differences in endowments and coefficients is computed. See the Methods and Formulas Section below.
detail[(dlist)] requests that detailed decomposition results for the individual regressors be reported. Use dlist to subsume the results for specific groups of regressors (variables not appearing in dlist are listed individually). The usual shorthand conventions apply to the varlists specified in dlist (see help varlist). For example, specify detail(exp=exp*) if the models contain exp (experience) and exp2 (experience squared). Note that individual results concerning the effect of changes/differences in coefficients may arbitrarily depend on the scaling of the regressors.
adjust(varlist) may be used to adjust the outcome differentials for the effects of certain variables (e.g. selection variables) before computing the decomposition.
eform causes the results to be displayed in exponentiated form.
nonotes suppresses the display of the legend.
Examples
. regress lnwage educ exp exp2 if sex==0 & year==1 . estimates store male1 . regress lnwage educ exp exp2 if sex==1 & year==1 . estimates store female1 . regress lnwage educ exp exp2 if sex==0 & year==2 . estimates store male2 . regress lnwage educ exp exp2 if sex==1 & year==2 . estimates store female2 . smithwelch male1 female1 male2 female2
. smithwelch male1 female1 male2 female2, benchmark(1) reference(1)
. generate byte year2 = year==2 . regress lnwage educ exp exp2 year2 if sex==0 & (year==1 | year==2) . estimates store male12 . regress lnwage educ exp exp2 year2 if sex==1 & (year==1 | year==2) . estimates store female12 . smithwelch male1 female1 male2 female2, benchmark(male12 female12)
. regress lnwage educ exp exp2 if year==1 . estimates store pooled1 . regress lnwage educ exp exp2 if year==2 . estimates store pooled2 . smithwelch male1 female1 male2 female2, reference(pooled1 pooled2)
Saved Results
Matrices:
r(D) Decomposition of individual differentials r(DD) Decomposition of difference in differentials r(b11) ... r(b22) Parameter vectors r(X11) ... r(X22) Vectors of means of regressors r(b1b), r(b2b) Parameter vectors for benchmark sample (if provided) r(br1), r(br2) Reference parameter vectors (if provided) r(brb) Reference parameter vector for benchmark sample (if provided)
Methods and Formulas
Consider the linear model
Y_gt = X_gt'b_gt + e_gt, E(e_gt) = 0, g = 1,2 t = 1,2,
where Y_gt is a vector of outcomes (e.g. log hourly wages) for group g at time t, X_gt is the data matrix (the values of the regressors), b_gt is a coefficients vector, and e_gt is the vector of residuals. The group differential in mean outcome at time t can be decomposed as follows (also see help oaxaca, if installed):
dy_t = y_1t - y_2t = x_1t'b_1t - x_2t'b_2t
= (x_1t-x_2t)'b_2t + x_2t'(b_1t-b_2t) + (x_1t-x_2t)'(b_1t-b_2t)
= dx_t'b_2t + x_2t'db_t + dx_t'db_t
= E + C + EC
where y_gt and x_gt symbolize group means and the "d" prefix indicates group differences. Thus, the mean outcome differential is decomposed into a part that is due to group differences in characteristics or "endowments" (E), a part that is due to differences in coefficients (including the intercept) (C), and a correction term capturing the interaction effect of differences in endowments and coefficients (EC). The fist term, E, measures the change in mean outcome for group 2 if, everything else equal, group 2 had the group 1 endowment levels. The second term, C, measures the change in mean outcome for group 2 if group 2 retained its own endowment levels, but had the group 1 coefficients. The last term, EC, quantifies the additional effect that is due to the combined differences in in endowments and coefficients.
Now suppose that we want to analyze the change in the outcome differential over time (or compare the outcome differentials for different countries). The change in the differential from t=1 to t=2 can be written as the sum of the changes in the decomposition components E, C, and CE:
dy_2 - dy_1 = [dx_2'b_22 - dx_1'b_21] + [x_22'db_2 - x_21'db_1]
+ [dx_2'db_2 - dx_1'db_1]
= dE + dC + dEC
Each of the three terms can again be divided into a part due to changes in the x's, a part due to changes in the b's, and an interaction effect accounting for the simultaneous change in the x's and b's:
dE = (dx_2-dx_1)'b_21 + dx_1'(b_22-b_21) + (dx_2-dx_1)'(b_22-b_21)
dC = (x_22-x_21)'db_1 + x_21'(db_2-db_1) + (x_22-x_21)'(db_2-db_1)
dEC = (dx_2-dx_1)'db_1 + dx_1'(db_2-db_1) + (dx_2-dx_1)'(db_2-db_1)
(E) (C) (CE)
Specifying reference models for the group differentials
It is common practice to remove the interaction term in the decomposition of the group differentials by specifying "reference" coefficients to be used with the decomposition (for example, the pooled estimates over both groups). Let b_rt indicate the reference coefficients vector at time t. The decomposition of the outcome differential at time t can then be written as:
dy_t = dx_t'b_rt + [x_1t'(b_1t-b_rt) + x_2t'(b_rt-b_2t)]
= E + C
Accordingly, the difference in differentials may be expressed as
dy_2 - dy_1 = dE + dC
with
dE = (dx_2-dx_1)'b_r1 + dx_1'(b_r2-b_r1) + (dx_2-dx_1)'(b_r2-b_r1)
dC = [(x_12-x_11)'(b_11-b_r1) + (x_22-x_21)'(b_r1-b_21)]
+ [x_11'((b_12-b_r2)-(b_11-b_r1))
+ x_21'((b_r2-b_22)-(b_r1-b_21))]
+ [(x_12-x_11)'((b_12-b_r2)-(b_11-b_r1))
+ (x_22-x_21)'((b_r2-b_22)-(b_r1-b_21))]
Note that the equations simplify a lot if the reference coefficients are the coefficients from the first group or the second group. For example, if b_rt=b_1t:
dy_t = dx_t'b_1t + x_2t'(b_1t-b_2t)
dy_2 - dy_1 = dE + dC
dE = (dx_2-dx_1)'b_11 + dx_1'(b_12-b_11) + (dx_2-dx_1)'(b_12-b_11)
dC = (x_22-x_21)'db_1 + x_21'(db_2-db_1) + (x_22-x_21)'(db_2-db_1)
Specifying a benchmark sample
Similarly, the number of terms in the decomposition of the change in differentials can be reduced by specifying a "benchmark" sample. Let b_1b and b_2b be the coefficient vectors from the benchmark sample for group 1 and group 2. The decomposition of the difference in differentials then is:
dy_2 - dy_1 = dE + dC + dEC
dE = (dx_2-dx_1)'b_2b + [dx_2'(b_22-b_2b) + dx_1'(b_2b-b_21)]
dC = (x_22-x_21)'db_b + [x_22'(db_2-db_b) + x_21'(db_b-db_1)]
dEC = (dx_2-dx_1)'db_b + [dx_2'(db_2-db_b) + dx_1'(db_b-db_1)]
Again, the formulas simplify if one of the two time points is the benchmark. For example, if b_gb=b_g1:
dE = (dx_2-dx_1)'b_21 + dx_2'(b_22-b_21)
dC = (x_22-x_21)'db_1 + x_22'(db_2-db_1)
dEC = (dx_2-dx_1)'db_1 + dx_2'(db_2-db_1)
Note that, if the benchmark estimates are the estimates from the pooled sample over both time points (or, e.g., all time points if there are more than two time points), it seems reasonable to include time point dummies in the models. While this is unproblematic for the decomposition of dE, it may have unwanted effects on the decomposition of dC (because the year dummies will appear in the first term of the decomposition of dC). A better solution would be to implicitly introduce the year dummies using the areg command for the benchmark estimates.
Specifying reference models and a benchmark sample
If reference and benchmark models both are specified, the formulas may be written as:
dy_2 - dy_1 = dE + dC
dE = (dx_2-dx_1)'b_rb + [dx_2'(b_r2-b_rb) + dx_1'(b_rb-b_r1)]
dC = [(x_12-x_11)'(b_1b-b_rb) + (x_22-x_21)'(b_rb-b_2b)]
+ [x_12'((b_12-b_r2)-(b_1b-b_rb))
+ x_22'((b_r2-b_22)-(b_rb-b_2b))
+ x_11'((b_1b-b_rb)-(b_11-b_r1))
+ x_21'((b_rb-b_2b)-(b_r1-b_21))]
where b_rb is the reference coefficients vector from the benchmark sample. Using the second group estimates as the reference estimates and the first time point as the benchmark yields the parametrization applied by Smith and Welch (1989):
dy_2 - dy_1 = dE + dC
dE = (dx_2-dx_1)'b_21 + dx_2'(b_22-b_21)
(1.i) (1.iii)
dC = (x_12-x_11)'(b_11-b_21) + x_12'((b_12-b_22)-(b_11-b_21))
(1.ii) (1.iv)
The numbers in parentheses beneath the decomposition components correspond to the equation numbers in Smith and Welch (1989:529). Furthermore, note that Smith and Welch use different indices (12 is 1, 22 is 2, 11 is 3, 21 is 4).
Technical notes:
- smithwelch does not require all models to contain the exact same set of regressors. Coefficients not appearing in a model are simply assumed to be zero for that model. However, it is important that all regressors are defined (i.e. non-missing) for all observations used with the decomposition. Thus, even if a regressor does not appear in an individual model, the regressor must contain valid values for the observations in the estimation sample of that model.
- If the models were estimated using weighted data (see help weight), smithwelch will take account of these weights in its computations of the means of the regressors.
- If multiple-equation models or models with ancillary parameters are used with smithwelch, only the first equation in e(b) is taken into account.
References
Heckman, James J., Thomas M. Lyons, Petra E. Todd (2000). Understanding Black-White Wage Differentials, 1960-1990. American Economic Review 90: 344-349. Lee, Sang-Hyop (2000). On Decomposing Changes in Male-Female Wage Gap. Working Paper No. 00-12. University of Hawaii at Manoa. Smith, James P., Finis R. Welch (1989). Black Economic Progress After Myrdal. Journal of Economic Literature 27: 519-564.
Author
Ben Jann, ETH Zurich, jann@soz.gess.ethz.ch
Also see
Online: help for regress, estimates, jmpierce2 (if installed), oaxaca