Second-order Juhn-Murphy-Pierce decomposition
jmpierce2 est11 est21 est12 est22 [ , benchmark(1|2|est1bm est2bm) reference(1|2|estref1 estref2 [estrefbm]) detail[(dlist)] parametric residuals(newvar1 newvar2|prefix) ranks(newvar1 newvar2|prefix) nonotes nopreserve ]
where dlist is
name1 = varlist1 [ , name2 = varlist2 [, ... ] ]
Description
jmpierce2 computes the decomposition of differences in mean outcome differentials proposed by Juhn, Murphy and Pierce (1991). An example is the decomposition of the change of the black-white or the male-female wage differential over time (Juhn, Murphy and Pierce 1991; Blau and Kahn 1997) or the decomposition of differences in the male-female wage differential between countries (Blau and Kahn 1992, 1996; OECD 2002).
est11, est21, est12, and est22 specify the previously fitted and stored regression estimates to be used with the decomposition (see help estimates store). The model estimated last may be indicated by a period (.), even if it has not yet been stored. est11 and est21 specify the group 1 estimate (e.g. male, white) and the group 2 estimate (e.g. female, black) for the first sample (e.g. time point 1, country A), est12 and est22 are the group estimates for the second sample (time point 2, country B). Note that the estimation samples (e(sample)) of the specified models determine the relevant observations for the decomposition. Group 1 and group 2 must not overlap.
See the smithwelch package (available from the SSC archive; type ssc describe smithwelch) for an alternative approach to decompose differences in differentials.
Warning: jmpierce2 is intended for use with models that have been estimated by the regress command. Use jmpierce2 with other models at your own risk.
Options
benchmark(1|2|est1bm est2bm) specifies (the estimates for) the "benchmark" sample. benchmark(1) signifies that sample 1 is the benchmark sample and est11 and est21 are the benchmark estimates. Analogously, est12 and est22 are used as the benchmark, if you specify benchmark(2). Alternatively, use benchmark(est1bm est2bm) to provide the estimates from another sample to be used as the benchmark (e.g. the pooled sample over all time points or countries). If benchmark() is omitted, an extended decomposition containing interaction terms for simultaneous changes in quantities and prices is computed. See the Methods and Formulas Section below.
reference(1|2|estref1 estref2 [estrefbm]) determines the reference coefficients and reference residual distributions within the samples to be used with the decomposition. The default is reference(1), meaning that the coefficients from the first group (i.e. est11 and est12) are used; reference(2) uses the group 2 estimates (est21 and est22). Alternatively, specify reference(estref1 estref2 [estrefbm]) to provide other reference estimates (e.g. models based on the pooled samples over both groups). estrefbm is required only if benchmark(est1bm est2bm) is specified.
detail[(dlist)] requests that detailed decomposition results for the individual regressors be reported (applies only to the decomposition of the change in the "predicted gap"; see the Methods and Formulas Section below). Use dlist to subsume the results for specific groups of regressors (variables not appearing in dlist are listed individually). The usual shorthand conventions apply to the varlists specified in dlist (see help varlist). For example, specify detail(exp=exp*) if the models contain exp (experience) and exp2 (experience squared).
parametric causes jmpierce2 to compute the decomposition using standardized residuals and residual standard deviations. The default is to apply a nonparametric approach based on the relative ranks of the residuals and the inverse residual distribution functions.
residuals(newvar1 newvar2|prefix) saves the imputed hypothetical residuals as variables (newvar1 or prefix1 for the first sample, newvar2 or prefix2 for the second sample).
ranks(newvar1 newvar2|prefix) saves the computed percentile ranks as variables (newvar1 or prefix1 for the first sample, newvar2 or prefix2 for the second sample).
nonotes suppresses the display of the legend.
nopreserve is a technical option. jmpierce2 internally preserves the data (see help preserve) and then drops all unused observations to speed up the computations. However, if nopreserve is specified, jmpierce2 skips preserving the data and keeps the unused observations in memory. nopreserve may make sense if there are only few unused observations or if parametric is specified.
Examples
. regress lnwage educ exp exp2 if sex==0 & year==1 . estimates store male1 . regress lnwage educ exp exp2 if sex==1 & year==1 . estimates store female1 . regress lnwage educ exp exp2 if sex==0 & year==2 . estimates store male2 . regress lnwage educ exp exp2 if sex==1 & year==2 . estimates store female2 . jmpierce2 male1 female1 male2 female2
. jmpierce2 male1 female1 male2 female2, benchmark(1)
. generate byte year2 = year==2 . regress lnwage educ exp exp2 year2 if sex==0 & (year==1 | year==2) . estimates store male12 . regress lnwage educ exp exp2 year2 if sex==1 & (year==1 | year==2) . estimates store female12 . jmpierce2 male1 female1 male2 female2, benchmark(male12 female12)
. regress lnwage educ exp exp2 if year==1 . estimates store pooled1 . regress lnwage educ exp exp2 if year==2 . estimates store pooled2 . jmpierce2 male1 female1 male2 female2, reference(pooled1 pooled2)
Saved Results
Matrices:
r(D) Decomposition of differentials r(DD) Decomposition of difference in differentials r(E) Decomposition of difference in predicted gap r(U) Decomposition of difference in residual gap r(b1) Parameter vector for sample 1 r(b2) Parameter vector for sample 2 r(b3) Parameter vector for benchmark sample (if provided) r(dX1) Vector of quantity differences for sample 1 r(dX2) Vector of quantity differences for sample 2
Methods and Formulas
Consider the linear model
y_t = x_t'b_t + e_t, E(e_t) = 0
where y_t is a vector of outcomes (e.g. log hourly wages) at time t, x_t is the data matrix (the values of the regressors), b_t is a coefficients vector, and e_t is the vector of residuals. The model can be reformulated as
y_t = x_t'b_t + r_t*s_t
where s_t represents the standard deviation of the residuals and r_t is the vector of standardized residuals. Thus, the equation now has a two-component residual, that is, the residuals are expressed as a function of the general residual inequality at time t and the positions of the residuals in the residual distribution.
Given two groups (e.g. males and females), the mean outcome differential between the two groups can then be decomposed as follows:
dy_t = dx_t'b_t + dr_t*s_t
where dy is the difference in mean outcomes between the groups, dx is a vector of the group differences in means of regressors, and dr is the group difference in mean standardized residuals. The first term, E = dx_t'b_t, is the "predicted gap". It reflects the "explained" part of the differential due to differences in "observed quantities" (aka "endowments" aka regressors). The second term, U = dr_t*s_t, is the "residual gap" and reflects the "unexplained" part of the differential (due to differences in "unobserved quantities", their "unobserved prices", and discrimination). It is easy to see that the "predicted gap" and the "residual gap" are equivalent to the explained part and the unexplained part in the standard Blinder-Oaxaca decomposition (see, e.g., help oaxaca; available from the SSC Archive, type ssc describe oaxaca).
Now, given two time points t=1 and t=2 (or, e.g., two countries), the change in the outcome differential can be written as
dy_2-dy_1 = [dx_2'b_2 - dx_1'b_1] + [dr_2*s_2 - dr_1*s_1]
where the first part on the right-hand side of the equation is the change in the "predicted gap" (dE) and the second part is the change in the "residual gap" (dU). The two terms can be further decomposed into
dE = (dx_2-dx_1)'b_1 + dx_1'(b_2-b_1) + (dx_2-dx_1)'(b_2-b_1)
and
dU = (dr_2-dr_1)s_1 + dr_1(s_2-s_1) + (dr_2-dr_1)(s_2-s_1)
The first term in the decomposition of dE reflects the portion of the change in the "predicted gap" that is explained by changes in the group differences in "observed quantities" (aka endowments) and the second term is the part that is due to changes in "observed prices" (aka coefficients). The third term is an adjustment term accounting for the interaction effect induced by the simultaneous change in quantities and prices. Similarly, the first term in the decomposition of dU, sometimes called the "gap effect", reflects the change that is due to changes in the group differences in residual positions (i.e. changes in the group differences in "unobserved quantities" and changes in discrimination) and the second term is the part due to changes in residual inequality (i.e. changes in "unobserved prices" for the "unobserved quantities"). The last term again adjusts for interaction.
It is common practice to reduce the three terms in the decompositions above to two terms only by employing the coefficients vector and residual variation from a "benchmark" sample. Be b_B the benchmark coefficients vector and s_B the benchmark residual standard deviation. The decompositions may then be written as
dE = (dx_2-dx_1)'b_B + [dx_2'(b_2-b_B) + dx_1'(b_B-b_1)]
and
dU = (dr_2-dr_1)s_B + [dr_2(s_2-s_B) + dr_1(s_B-s_1)]
If one of the two time points is the benchmark, the formulas simplify to the parametrization applied by Juhn, Murphy and Pierce (1991), that is
dE = (dx_2-dx_1)'b_1 + dx_2'(b_2-b_1) dU = (dr_2-dr_1)s_1 + dr_2(s_2-s_1)
or the parametrization applied by, e.g., Blau and Kahn (1997), that is
dE = (dx_2-dx_1)'b_2 + dx_1'(b_2-b_1) dU = (dr_2-dr_1)s_2 + dr_1(s_2-s_1)
An alternative would be, for example, to use the pooled sample over all time points as the benchmark sample. Note that in this case it is reasonable to include year dummies in the models for the benchmark sample (see, e.g., OECD 2002:103).
Nonparametric implementation of the decomposition of dU
By definition, e_t = r_t*s_t. Therefore, dr_1*s_1 is simply the group difference in mean residuals at t=1 and dr_2*s_2 is the difference in mean residuals at t=2. But what about dr_1*s_2 or dr_2*s_1? One obvious solution would be to estimate the residual standard deviations and the standardized residuals for both time points and then multiply the standard deviation of one time point with the mean difference in standardized residuals of the other. This approach is applied by jmpierce2 if specifying the parametric option. The disadvantage of the parametric approach is that differences in distributional shape (apart from the variance of the distribution) are neglected. Therefore, Juhn et al. (1991) proposed the following non-parametric procedure, which is the default procedure in jmpierce2. Let F_t() be the distribution function of the residuals at time t. Furthermore, let q_t represent the positions of the residuals in the residual distribution at time t (see help relrank; available from the SSC Archive, type ssc describe relrank), that is
q_t = F_t(e_t)
Furthermore
e_t = F[-1]_t(q_t)
where F[-1]_t() stands for the inverse of F_t() (see help invcdf; available from the SSC Archive, type ssc describe invcdf). Applying the inverse distribution function of one time point to the residual ranks of the other, leads to a non-parametric version of the decomposition of dU. For example, dr_1*s_2 is obtained by assigning each individual at t=1 a percentile number corresponding to its position in the residual distribution of t=1, then using these relative ranks to derive hypothetical residuals for the t=1 individuals given the t=2 residual distribution function, and finally computing the group difference in the means of these hypothetical residuals.
Reference coefficients and reference residual distribution
For each time point, a reference model must be specified to determine the coefficients and residual distribution to be used in the decomposition. The default is to use est11 and est12 as the reference models (see the reference() option). From a technical point of view, two situations have to be distinguished. First, the reference model may be the group 1 model (reference(1)) or the group 2 model (reference(2)). In these cases, the coefficients of that model are used to compute the residuals for both groups, but only the observations in the reference group are used to determine the residual distribution function. Second, the reference model may be some other model (e.g. a pooled model over both groups). In this case, the coefficients from the reference model are again used to compute the residuals for both groups. The residual distribution function, however, is not derived from these residuals. It is instead computed using the pooled residuals from the two group-specific models.
Technical notes:
- jmpierce2 does not require all models to contain the exact same set of regressors. Coefficients not appearing in a model are simply assumed to be zero for that model. However, it is important that all regressors are defined (i.e. non-missing) for all observations used with the decomposition. Thus, even if a regressor does not appear in an individual model, the regressor must contain valid values for the observations in the estimation sample of that model.
- jmpierce2 computes residuals as the differences between the values of the model's dependent variable and the model's linear predictions (using matrix score). If the models have been estimated using weighted data, jmpierce2 will take account of these weights in its computations. In the parametric mode, jmpierce2 will use the value of e(rmse) as the model's residual standard deviation. If multiple-equation models or models with ancillary parameters are used with jmpierce2, only the first equation in e(b) is taken into account.
References
Juhn, Chinhui, Kevin M. Murphy, Brooks Pierce (1991). Accounting for the Slowdown in Black-White Wage Convergence. Pp. 107-143 in: Workers and Their Wages, ed. by Marvin Kosters, Washington, DC: AEI Press. Blau, Francine D., Lawrence M. Kahn (1992). The Gender Earnings Gap: Learning from International Comparisons. American Economic Review 82: 533-538. Blau, Francine D., Lawrence M. Kahn (1996). Wage Structure and Gender Earnings Differentials: an International Comparison. Economica 63: S29-S62. Blau, Francine D., Lawrence M. Kahn (1997). Swimming Upstream: Trends in the Gender Wage Differential in the 1980s. Journal of Labor Economics 15: 1-42. OECD (2002). Employment Outlook, Chapter 2. Paris.
Author
Ben Jann, ETH Zurich, jann@soz.gess.ethz.ch
Also see
Online: help for regress, estimates, cumul, smithwelch (if installed), jmp (if installed), oaxaca (if installed), invcdf (if