{smcl}
{* *! version 3.0 3 May 2021}{...}
{hline}
help for {hi:ky_fit}{right:Stephen P. Jenkins and Fernando Rios-Avila (May 2021)}
{hline}
{vieweralsosee "postestimation" "help ky_estat"}{...}
{vieweralsosee "ky_sim" "help ky_sim"}{...}
{viewerjumpto "Syntax" "ky_fit##syntax"}{...}
{viewerjumpto "Description" "ky_fit##description"}{...}
{viewerjumpto "Options" "ky_fit##options"}{...}
{viewerjumpto "Remarks" "ky_fit##remarks"}{...}
{viewerjumpto "Examples" "ky_fit##examples"}{...}
{viewerjumpto "Authors" "ky_fit##authors"}{...}
{viewerjumpto "References" "ky_fit##references"}{...}
{title:Fitting mixture models of the Kapteyn-Ypma type to linked survey and administrative data}
{marker syntax}{...}
{title:Syntax}
{p 8 17 2}
{cmdab:ky_fit}
r_var s_var [cl_var]
[{help if}]
[{help in}]
[{help weights:pw fw aw iw}]
[{cmd:,}
{it:options}]
{marker options}{...}
{title:Options}
{synoptset 30 tabbed}{...}
{synopthdr}
{synoptline}
{dlgtab:Required}
{synopt:{opt r_var s_var [cl_var]}} Two variables are required for model
estimation. r_var refers to the measure of (log) earnings from the
administrative data. s_var refers to the measure of (log) earnings from the
survey data. cl_var is a binary variable that identifies the 'completely
labelled' group (observations for whom r_var and s_var are judged by the
analyst to be sufficiently close to each other and hence also latent 'true'
earnings to be counted as error-free) {p_end}
{synopt:} If cl_var is not declared, a variable named __ll__ is created
identifying observations for which abs(r_var {c -} s_var) <= delta {p_end}
{synopt:{opt delta(#)}} Declares the value taken by variable __ll__ when cl_var
is not declared. Default is 0.{p_end}
{synopt:{opt model(#)}} Selects the model that is fitted. Eight different models
are possible, corresponding to # = 1 through # = 8. The default value is # = 1
(Basic model). See the Description for further details. {p_end}
{dlgtab:Maximization options}
{synopt:{opt from(init_specs)}} initial values for the coefficients {p_end}
{synopt:{opt constraint(string)}} constraints by number to be applied {p_end}
{synopt:{opt technique(algorithm_spec)}} maximization technique {p_end}
{synopt:{opt search(srch opt)}} search options {p_end}
{synopt:{opt robust}} reports robust standard errors {p_end}
{synopt:{opt cluster(clvar)}} reports clustered standard errors {p_end}
{synopt:{opt trace}} display current parameter vector in iteration log {p_end}
{synopt:{opt diff:icult}} use a different stepping algorithm in nonconcave
regions {p_end}
{dlgtab:Display options}
{synopt:{opt base:levels}} specifies that base levels be reported for factor
variables and for interactions with bases that cannot be inferred from
their component factor variables. {p_end}
{synopt:{opt allbase:levels}} specifies that all base levels of factor
variables and interactions be reported. {p_end}
{dlgtab:Model specification options}
{phang} For each {cmd:model} fitted, every parameter can be modelled as a
function of covariates using a {it:varlist}. This allows for richer
specifications in which there is (additional) heterogeneity related to
observed characteristics. Factor variable notation can be used to specify
{it:varlist}. {p_end}
{synopt:{opt model(1)}} The following parameters can be made functions of covariates: {p_end}
{synopt:} mu_e({it:varlist}), ln_sig_e({it:varlist}), {p_end}
{synopt:} mu_n({it:varlist}), ln_sig_n({it:varlist}), {p_end}
{synopt:} arho_s({it:varlist}) and lpi_s({it:varlist}) {p_end}
{synopt:{opt model(2)}} The following parameters can be made functions of covariates: {p_end}
{synopt:} mu_e({it:varlist}), ln_sig_e({it:varlist}), {p_end}
{synopt:} mu_n({it:varlist}), ln_sig_n({it:varlist}), {p_end}
{synopt:} mu_w({it:varlist}), ln_sig_w({it:varlist}), {p_end}
{synopt:} arho_s({it:varlist}) and lpi_s({it:varlist}) {p_end}
{synopt:{opt model(3)}} The following parameters can be made functions of covariates: {p_end}
{synopt:} mu_e({it:varlist}), ln_sig_e({it:varlist}), {p_end}
{synopt:} mu_n({it:varlist}), ln_sig_n({it:varlist}), {p_end}
{synopt:} mu_t({it:varlist}), ln_sig_t({it:varlist}), {p_end}
{synopt:} arho_s({it:varlist}), lpi_r({it:varlist}) and lpi_s({it:varlist}) {p_end}
{synopt:{opt model(4)}} The following parameters can be made functions of covariates: {p_end}
{synopt:} mu_e({it:varlist}), ln_sig_e({it:varlist}), {p_end}
{synopt:} mu_n({it:varlist}), ln_sig_n({it:varlist}), {p_end}
{synopt:} mu_t({it:varlist}), ln_sig_t({it:varlist}), {p_end}
{synopt:} mu_w({it:varlist}), ln_sig_w({it:varlist}), {p_end}
{synopt:} arho_s({it:varlist}), {p_end}
{synopt:} lpi_r({it:varlist}), lpi_s({it:varlist}) and lpi_w({it:varlist}) {p_end}
{synopt:{opt model(5)}} The following parameters can be made functions of covariates: {p_end}
{synopt:} mu_e({it:varlist}), ln_sig_e({it:varlist}), {p_end}
{synopt:} mu_n({it:varlist}), ln_sig_n({it:varlist}), {p_end}
{synopt:} mu_t({it:varlist}), ln_sig_t({it:varlist}), {p_end}
{synopt:} mu_w({it:varlist}), ln_sig_w({it:varlist}), {p_end}
{synopt:} mu_v({it:varlist}), ln_sig_v({it:varlist}), {p_end}
{synopt:} arho_r({it:varlist}), arho_s({it:varlist}), {p_end}
{synopt:} lpi_r({it:varlist}), lpi_s({it:varlist}), lpi_w({it:varlist}), {p_end}
{synopt:} and lpi_v({it:varlist}) {p_end}
{synopt:{opt model(6)}} The following parameters can be made functions of covariates: {p_end}
{synopt:} mu_e({it:varlist}), ln_sig_e({it:varlist}), {p_end}
{synopt:} mu_n({it:varlist}), ln_sig_n({it:varlist}), {p_end}
{synopt:} mu_t({it:varlist}), ln_sig_t({it:varlist}), {p_end}
{synopt:} mu_v({it:varlist}), ln_sig_v({it:varlist}), {p_end}
{synopt:} arho_r({it:varlist}), arho_s({it:varlist}), {p_end}
{synopt:} lpi_r({it:varlist}), lpi_s({it:varlist}), {p_end}
{synopt:} and lpi_v({it:varlist}) {p_end}
{synopt:{opt model(7)}} The following parameters can be made functions of covariates: {p_end}
{synopt:} mu_e({it:varlist}), ln_sig_e({it:varlist}), {p_end}
{synopt:} mu_n({it:varlist}), ln_sig_n({it:varlist}), {p_end}
{synopt:} mu_t({it:varlist}), ln_sig_t({it:varlist}), {p_end}
{synopt:} mu_w({it:varlist}), ln_sig_w({it:varlist}), {p_end}
{synopt:} arho_s({it:varlist}), arho_w({it:varlist}), {p_end}
{synopt:} lpi_r({it:varlist}), lpi_s({it:varlist}) and lpi_w({it:varlist}) {p_end}
{synopt:{opt model(8)}} The following parameters can be made functions of covariates: {p_end}
{synopt:} mu_e({it:varlist}), ln_sig_e({it:varlist}), {p_end}
{synopt:} mu_n({it:varlist}), ln_sig_n({it:varlist}), {p_end}
{synopt:} mu_t({it:varlist}), ln_sig_t({it:varlist}), {p_end}
{synopt:} mu_w({it:varlist}), ln_sig_w({it:varlist}), {p_end}
{synopt:} mu_v({it:varlist}), ln_sig_v({it:varlist}), {p_end}
{synopt:} arho_r({it:varlist}), arho_s({it:varlist}), {p_end}
{synopt:} arho_w({it:varlist}), {p_end}
{synopt:} lpi_r({it:varlist}), lpi_s({it:varlist}), lpi_w({it:varlist}), {p_end}
{synopt:} and lpi_v({it:varlist}) {p_end}
{synoptline}
{p2colreset}{...}
{p 4 6 2}
{cmd:aweight}s, {cmd:fweight}s, {cmd:iweight}s, and {cmd:pweight}s are
allowed; see {help weight}.{p_end}
{marker description}{...}
{title:Description}
{pstd} {cmd:ky_fit} fits 8 finite mixture models of earnings and measurement
errors of various types using linked survey and administrative data on earnings
(or similar variables). The first four models were proposed by Kapteyn and
Ypma (2007); the second four models are generalisations of the Kapteyn-Ypma
models proposed by Jenkins and Rios-Avila (2021c). We refer here to the general
class of models as "KY" models. Other innovations relative to Kapteyn and Ypma
(2007) are: (i) we allow model parameters to be functions of covariates, and
(ii) we incorporate a potential non-zero correlation betweeen the latent true
earnings and contamination error.
{pstd} The data comprise a set of observations ('workers', i = 1, ..., N),
for each of which there are 2 measures of (log) earnings: (a) from a
survey (denoted s_i) and (b) from an administrative record dataset (r_i). We
denote latent true (log) earnings by e_i. We refer to the mean of x as mu_x, and
the standard deviation (SD) of x as sig_x. The different combinations of
error-ridden and/or error-free observations characterize latent classes.
Latent class probabilities depend on the probabilities of the different
types of error.
{pstd} Each of the 8 models is a finite mixture of up to nine bivariate normal
distributions. In {cmd:ky_fit} we label the KY models 1{c -}8, where
model 1, the Basic Model, is the simplest, and model 8, the Extended Model, is the
most general. Next we set out out the structure of the Extended Model; the other
models are special cases of this. See Jenkins and Rios-Avila (2021b, c) for
further details.
{pstd} The distribution of administrative earnings contains observations
that are correctly linked to survey records with probability pi_r,
as well as observations for which the linkage is incorrect ('mismatch') with
probability (1{c -}pi_r). Even if observations are correctly linked, some
values of r_i may be subject to regression-to-the-mean (RTM) measurement error
with probability (1{c -}pi_v). For mismatched individuals, observed
administrative earnings are a draw from the distribution of earnings in the
administrative data. In sum, there are three types of r_i observation:
{p 8 12 2}R1: r_i = e_i, with probability pi_r * pi_v {p_end}
{p 8 12 2}R2: r_i = e_i + rho_r*(e_i{c -}mu_e) + v_i , with probability pi_r * (1 {c -} pi_v) {p_end}
{p 8 12 2}R3: r_i = t_i, with probability 1 {c -} pi_r {p_end}
{pstd} The distribution of survey earnings contains three types of observation.
First, there are observations with earnings that are reported correctly
(i.e. without error), with probability pi_s. Second, there are observations with
earnings reported with measurement error with the error including a RTM
component, with probability (1{c -}pi_s)*(1{c -}pi_w). Third, there are
observations that contain additional 'contamination' error in addition to
RTM measurement error, with probability (1{c -}pi_s)*pi_w. In sum, there are
three types of s_i observation:
{p 8 12 2}S1: s_i = e_i, with probability pi_s {p_end}
{p 8 12 2}S2: s_i = e_i + rho_s*(e_i{c -}mu_e) + n_i
, with probability (1-pi_s)*(1-pi_w) {p_end}
{p 8 12 2}S3: s_i = e_i + rho_s*(e_i{c -}mu_e) + n_i + w_i
, with probability (1{c -}pi_s)*pi_w {p_end}
{pstd} For model fitting, we follow KY and assume that errors are independently
and identically distributed normal, with the exception of e_i and w_i,
for which we assume a bivariate normal distribution with correlation rho_w:
{p 8 12 2}(e_i, w_i) ~ BN([mu_e,mu_w], [(sig_e)^2,(sig_w)^2],rho_w*sig_e*sig_w) {p_end}
{p 8 12 2}n_i ~ N(mu_n, (sig_n)^2) {p_end}
{p 8 12 2}v_i ~ N(mu_v, (sig_v)^2) {p_end}
{p 8 12 2}t_i ~ N(mu_t, (sig_t)^2) {p_end}
{pstd} The Extended Model, {cmd:ky_fit}'s model 8, has nine latent classes. It
is a mixture of nine bivariate distributions representing combinations of the
three types of administrative data observation and the three types of survey
data observation. Class 1 contains 'completely labeled' observations, i.e. those
for which survey and administrative data earnings measures are error-free and
hence also equal to latent true earnings.
{p 8 12 2}Class 1: r_i ~ R1 and s_i ~ S1, with probability pi_r*pi_v*pi_s {p_end}
{p 8 12 2}Class 2: r_i ~ R1 and s_i ~ S2, with probability pi_r*pi_v*(1{c -}pi_s)*(1{c -}pi_w) {p_end}
{p 8 12 2}Class 3: r_i ~ R1 and s_i ~ S3, with probability pi_r*pi_v*(1{c -}pi_s)*pi_w {p_end}
{p 8 12 2}Class 4: r_i ~ R2 and s_i ~ S1, with probability pi_r*(1{c -}pi_v)*pi_s {p_end}
{p 8 12 2}Class 5: r_i ~ R2 and s_i ~ S2, with probability pi_r*(1{c -}pi_v)*(1{c -}pi_s)*(1{c -}pi_w) {p_end}
{p 8 12 2}Class 6: r_i ~ R2 and s_i ~ S3, with probability pi_r*(1{c -}pi_v)*(1{c -}pi_s)*pi_w {p_end}
{p 8 12 2}Class 7: r_i ~ R3 and s_i ~ S1, with probability (1{c -}pi_r)*pi_s {p_end}
{p 8 12 2}Class 8: r_i ~ R3 and s_i ~ S2, with probability (1{c -}pi_r)*(1{c -}pi_s)*(1{c -}pi_w) {p_end}
{p 8 12 2}Class 9: r_i ~ R3 and s_i ~ S3, with probability (1{c -}pi_r)*(1{c -}pi_s)*pi_w {p_end}
{pstd} Models 1 and 2 assume that the administrative data contain no error (i.e.
no mismatch and no measurement error). {p_end}
{pstd} Models 3, 4, and 7 assume that the administrative data contain only
mismatch error. {p_end}
{pstd} Models 5, 6, and 8 assume that the administrative data contain mismatch
error and RTM error. {p_end}
{pstd} Models 1, 3, and 6 assume that the survey data contain only RTM
measurement error. {p_end}
{pstd} Models 2, 4, 5, 7 and 8 assume that the survey data contain RTM
measurement error plus contamination error. {p_end}
{pstd} Models 7 and 8 assume the survey contamination error is correlated with
the latent true earnings. {p_end}
{pstd} When fitting the models by maximum likelihood, we transform parameters
other than means to ensure their estimates lie within their theoretical ranges.
That is, we ensure that all standard deviations are strictly positive; the RTM
parameters and correlation between w_i and e_i lie between {c -}1 and 1; and
error probabilities pi_s, pi_r, pi_w, and pi_v each lie between 0 and 1. {p_end}
{pstd}To report parameters in their natural metric, invert the transformations: {p_end}
{p 8 12 2}sig_x = exp(ln_sig_x) for x = e, n, w, t, v {p_end}
{p 8 12 2}rho_x = tanh(arho_x) for x = r, s, w {p_end}
{p 8 12 2}pi_x = logistic(lpi_x) for x = s, w, r, v {p_end}
{pstd} We provide post-estimation utilities {cmd:ky_estat} and {cmd:ky_p} to
enable users to derive parameters (and SEs) in their natural metric. See
Jenkins and Rios-Avila (2021b, 2021c) for details.
{pstd} Users should experiment with multiple sets of initial values to check
that models converge to a global maximum rather than some local maximum. (The
risk of convergence to local maxima is greater for models with covariates.)
{cmd:ky_fit} fits models in a sequential fashion, beginning with simpler models
that provide starting values for more complex models. This reduces the risk of
convergence to local maxima but does not remove it altogether.
{pstd}See Kapteyn and Ypma (2007) for details of models 1{c -}4, and estimates
for a sample of Swedish workers aged 50+ years. For the same models, Jenkins
and Rios-Avila (2020) provide estimates for a sample of UK workers from
across the full range, also analyzing estimate sensitivity to the choice of the
'completely labelled' fraction (the size of class 1). Meijer, Rohwedder, and
Wansbeek (2012) derive hybrid earnings predictors of latent true earnings
combining information from administrative and survey data and model estimates,
and measures of reliability. They illustrate their methods using Kapteyn and
Ypma's (2007) Full model (what we label model 4). Jenkins and Rios-Avila (2021a)
replicate Meijer, Rohwedder, and Wansbeek's analysis, and apply their methods
to estimates of model 4 derived from UK data. {p_end}
{pstd} See Jenkins and Rios-Avila (2021c) for discussion of models 5{c -}8, and
estimates for models with and without parameters expressed as functions of
covariates. They use UK data. Jenkins and Rios-Avila (2021b) discuss in greater
detail model fitting using {cmd:ky_fit} and post-estimation methods,
including the methods of Meijer, Rohwedder, and Wansbeek (2012) for models
1{c -}8. {p_end}
{marker examples}{...}
{title:Examples}
{pstd}
{pstd} For detailed example, please see do-file "{stata doedit ky_example.do:ky_example.do}".
{p_end}
{marker results}{...}
{title:Stored results}
{pstd}
In addition to the results stored from {cmd:ml} (see {help maximize}),
{cmd:ky_fit} stores the following in {cmd:e()}:{p_end}
{synoptset 20 tabbed}{...}
{p2col 5 20 24 2: Scalars}{p_end}
{synopt:{cmd:e(ic)}}Number of iterations of last model.
Does not include iterations of intermediate models, if any.{p_end}
{synopt:{cmd:e(method_c)}}Code for estimated model. See model specification.{p_end}
{synoptset 20 tabbed}{...}
{p2col 5 20 24 2: Macros}{p_end}
{synopt:{cmd:e(predict)}}{cmd:ky_p}: program used to implement {cmd:predict}.{p_end}
{synopt:{cmd:e(estat_cmd)}}{cmd:ky_estat}: program used to implement post-estimation statistics {cmd:estat}.{p_end}
{synopt:{cmd:e(method)}}Description of model specification.{p_end}
{synopt:{cmd:e(cmd)}}{cmd:ky_fit}{p_end}
{synopt:{cmd:e(cmdline)}}command as typed{p_end}
{synopt:{cmd:e(depvar)}}names of dependent variables, i.e. the administrative
and survey log earnings variables, and completely-labeled group identifier{p_end}
{synopt:{cmd:e(user)}}name of likelihood-evaluator program. ky_ll_# with # =
1, 2, ..., 8.{p_end}
{marker references}{...}
{title:References}
{pstd}Jenkins, S. P. and Rios-Avila, F. (2020).
Measurement errors in survey and administrative data on employment earnings:
sensitivity to the fraction assumed to have error-free earnings’,
{it: Economics Letters}, 192: 109253. {browse "https://doi.org/10.1016/j.econlet.2020.109253"}
{pstd}Jenkins, S. P. and Rios-Avila, F. (2021a).
Measurement error in earnings data: replication of
Meijer, Rohwedder, and Wansbeek’s mixture model
approach to combining survey and register data,
{it: Journal of Applied Econometrics}, online first.
{browse "https://doi.org/10.1002/jae.2811"}
{pstd}Jenkins, S. P. and Rios-Avila, F. (2021b).
Finite mixture models for linked survey and administrative data: estimation and
post-estimation. IZA Discussion Paper, forthcoming.
{browse "https://www.iza.org/publications/dp"}
For submission to {it:The Stata Journal}.
{pstd}Jenkins, S. P. and Rios-Avila, F. (2021c).
Reconciling reports: modelling employment earnings and measurement errors
using linked survey and administrative data.
IZA Discussion Paper, forthcoming. {browse "https://www.iza.org/publications/dp"}
{pstd}Kapteyn, A. and Ypma, Y. A. (2007). Measurement error and misclassification: a
comparison of survey and administrative data.
{it: Journal of Labor Economics} 25 (3): 513{c -}551.
{browse "https://www.journals.uchicago.edu/doi/abs/10.1086/513298"}
{pstd}Meijer, E., Rohwedder, S. and Wansbeek T. (2012). Measurement error in
earnings data: using a mixture model approach to combine survey and register data.
{it:Journal of Business & Economic Statistics} 30 (2): 191{c -}201.
{browse "https://www.tandfonline.com/doi/abs/10.1198/jbes.2011.08166"}
{marker authors}{...}
{title:Authors}
{pstd}
Stephen P. Jenkins {break}
Department of Social Policy{break}
London School of Economics and Political Science {break}
Houghton Street, London WC2A 2AE, UK{break}
Email: s.jenkins@lse.ac.uk
{pstd}
Fernando Rios-Avila{break}
Levy Economics Institute of Bard College{break}
Annandale-on-Hudson, NY 12504-5000, USA{break}
Email: friosavi@levy.org
{marker alsosee}{...}
{title:Also see}
{p 4 13 2}
{help ky_estat} if installed; {help ky_sim} if installed.