{smcl}
{* *! version 3.0 3 May 2021}{...}

{hline}
help for {hi:ky_fit}{right:Stephen P. Jenkins and Fernando Rios-Avila (May 2021)}
{hline}

{vieweralsosee "postestimation" "help ky_estat"}{...}
{vieweralsosee "ky_sim" "help ky_sim"}{...}
{viewerjumpto "Syntax" "ky_fit##syntax"}{...}
{viewerjumpto "Description" "ky_fit##description"}{...}
{viewerjumpto "Options" "ky_fit##options"}{...}
{viewerjumpto "Remarks" "ky_fit##remarks"}{...}
{viewerjumpto "Examples" "ky_fit##examples"}{...}
{viewerjumpto "Authors" "ky_fit##authors"}{...}
{viewerjumpto "References" "ky_fit##references"}{...}

{title:Fitting mixture models of the Kapteyn-Ypma type to linked survey and administrative data}

{marker syntax}{...}
{title:Syntax}
{p 8 17 2}
{cmdab:ky_fit}
r_var s_var [cl_var]
[{help if}]
[{help in}]
[{help weights:pw fw aw iw}]
[{cmd:,}
{it:options}]


{marker options}{...}
{title:Options}

{synoptset 30 tabbed}{...}
{synopthdr}
{synoptline}

{dlgtab:Required}

{synopt:{opt r_var s_var [cl_var]}} Two variables are required for model 
estimation. r_var refers to the measure of (log) earnings from the 
administrative data. s_var refers to the measure of (log) earnings from the 
survey data. cl_var is a binary variable that identifies the 'completely 
labelled' group (observations for whom r_var and s_var are judged by the 
analyst to be sufficiently close to each other and hence also latent 'true' 
earnings to be counted as error-free) {p_end}

{synopt:} If cl_var is not declared, a variable named __ll__ is created
identifying observations for which abs(r_var {c -} s_var) <= delta {p_end}

{synopt:{opt delta(#)}} Declares the value taken by variable __ll__ when cl_var
 is not declared. Default is 0.{p_end}

{synopt:{opt model(#)}} Selects the model that is fitted. Eight different models 
are possible, corresponding to # = 1 through # = 8. The default value is # = 1 
(Basic model). See the Description for further details. {p_end}

{dlgtab:Maximization options}

{synopt:{opt from(init_specs)}} initial values for the coefficients  {p_end}
{synopt:{opt constraint(string)}} constraints by number to be applied {p_end}
{synopt:{opt technique(algorithm_spec)}}  maximization technique  {p_end}
{synopt:{opt search(srch opt)}} search options {p_end}
{synopt:{opt robust}} reports robust standard errors {p_end}
{synopt:{opt cluster(clvar)}} reports clustered standard errors {p_end}
{synopt:{opt trace}} display current parameter vector in iteration log {p_end}
{synopt:{opt diff:icult}} use a different stepping algorithm in nonconcave 
regions {p_end}

{dlgtab:Display options}

{synopt:{opt base:levels}} specifies that base levels be reported for factor 
variables and for interactions with bases that cannot be inferred from 
their component factor variables.  {p_end}
{synopt:{opt allbase:levels}}  specifies that all base levels of factor 
variables and interactions be reported. {p_end}

{dlgtab:Model specification options}

{phang} For each {cmd:model} fitted, every parameter can be modelled as a 
function of covariates using a {it:varlist}. This allows for richer 
specifications in which there is (additional) heterogeneity related to 
observed characteristics. Factor variable notation can be used to specify 
{it:varlist}. {p_end}
 
{synopt:{opt model(1)}} The following parameters can be made functions of covariates: {p_end}
{synopt:} 		mu_e({it:varlist}), ln_sig_e({it:varlist}),  {p_end}
{synopt:} 		mu_n({it:varlist}), ln_sig_n({it:varlist}),  {p_end}
{synopt:}		arho_s({it:varlist}) and lpi_s({it:varlist}) {p_end}

{synopt:{opt model(2)}} The following parameters can be made functions of covariates: {p_end} 
{synopt:} 		mu_e({it:varlist}), ln_sig_e({it:varlist}),  {p_end}
{synopt:} 		mu_n({it:varlist}), ln_sig_n({it:varlist}),  {p_end}
{synopt:} 		mu_w({it:varlist}), ln_sig_w({it:varlist}),  {p_end}
{synopt:}		arho_s({it:varlist}) and lpi_s({it:varlist}) {p_end} 

{synopt:{opt model(3)}} The following parameters can be made functions of covariates: {p_end} 
{synopt:} 		mu_e({it:varlist}), ln_sig_e({it:varlist}),  {p_end}
{synopt:} 		mu_n({it:varlist}), ln_sig_n({it:varlist}),  {p_end}
{synopt:} 		mu_t({it:varlist}), ln_sig_t({it:varlist}),  {p_end}
{synopt:}		arho_s({it:varlist}), lpi_r({it:varlist}) and lpi_s({it:varlist}) {p_end}

{synopt:{opt model(4)}} The following parameters can be made functions of covariates: {p_end} 
{synopt:} 		mu_e({it:varlist}), ln_sig_e({it:varlist}),  {p_end}
{synopt:} 		mu_n({it:varlist}), ln_sig_n({it:varlist}),  {p_end}
{synopt:} 		mu_t({it:varlist}), ln_sig_t({it:varlist}),  {p_end}
{synopt:} 		mu_w({it:varlist}), ln_sig_w({it:varlist}),  {p_end}
{synopt:}		arho_s({it:varlist}), {p_end}
{synopt:}		lpi_r({it:varlist}), lpi_s({it:varlist}) and lpi_w({it:varlist}) {p_end}

{synopt:{opt model(5)}} The following parameters can be made functions of covariates: {p_end} 
{synopt:} 		mu_e({it:varlist}), ln_sig_e({it:varlist}),  {p_end}
{synopt:} 		mu_n({it:varlist}), ln_sig_n({it:varlist}),  {p_end}
{synopt:} 		mu_t({it:varlist}), ln_sig_t({it:varlist}),  {p_end}
{synopt:} 		mu_w({it:varlist}), ln_sig_w({it:varlist}),  {p_end}
{synopt:} 		mu_v({it:varlist}), ln_sig_v({it:varlist}),  {p_end}
{synopt:}		arho_r({it:varlist}), arho_s({it:varlist}),  {p_end}
{synopt:}		lpi_r({it:varlist}), lpi_s({it:varlist}), lpi_w({it:varlist}), {p_end}
{synopt:}		and lpi_v({it:varlist}) {p_end} 

{synopt:{opt model(6)}} The following parameters can be made functions of covariates: {p_end} 
{synopt:} 		mu_e({it:varlist}), ln_sig_e({it:varlist}),  {p_end}
{synopt:} 		mu_n({it:varlist}), ln_sig_n({it:varlist}),  {p_end}
{synopt:} 		mu_t({it:varlist}), ln_sig_t({it:varlist}),  {p_end}
 {synopt:} 		mu_v({it:varlist}), ln_sig_v({it:varlist}),  {p_end}
{synopt:}		arho_r({it:varlist}), arho_s({it:varlist}),  {p_end}
{synopt:}		lpi_r({it:varlist}), lpi_s({it:varlist}), {p_end}
{synopt:}		and lpi_v({it:varlist}) {p_end} 

{synopt:{opt model(7)}} The following parameters can be made functions of covariates: {p_end} 
{synopt:} 		mu_e({it:varlist}), ln_sig_e({it:varlist}),  {p_end}
{synopt:} 		mu_n({it:varlist}), ln_sig_n({it:varlist}),  {p_end}
{synopt:} 		mu_t({it:varlist}), ln_sig_t({it:varlist}),  {p_end}
{synopt:} 		mu_w({it:varlist}), ln_sig_w({it:varlist}),  {p_end}
{synopt:}		arho_s({it:varlist}), arho_w({it:varlist}), {p_end}
{synopt:}		lpi_r({it:varlist}), lpi_s({it:varlist}) and lpi_w({it:varlist}) {p_end}

{synopt:{opt model(8)}} The following parameters can be made functions of covariates: {p_end} 
{synopt:} 		mu_e({it:varlist}), ln_sig_e({it:varlist}),  {p_end}
{synopt:} 		mu_n({it:varlist}), ln_sig_n({it:varlist}),  {p_end}
{synopt:} 		mu_t({it:varlist}), ln_sig_t({it:varlist}),  {p_end}
{synopt:} 		mu_w({it:varlist}), ln_sig_w({it:varlist}),  {p_end}
{synopt:} 		mu_v({it:varlist}), ln_sig_v({it:varlist}),  {p_end}
{synopt:}		arho_r({it:varlist}), arho_s({it:varlist}),  {p_end}
{synopt:}		arho_w({it:varlist}), {p_end}
{synopt:}		lpi_r({it:varlist}), lpi_s({it:varlist}), lpi_w({it:varlist}), {p_end}
{synopt:}		and lpi_v({it:varlist}) {p_end} 

{synoptline}
{p2colreset}{...}

{p 4 6 2}
{cmd:aweight}s, {cmd:fweight}s, {cmd:iweight}s, and {cmd:pweight}s are
allowed; see {help weight}.{p_end}


{marker description}{...}
{title:Description}

{pstd} {cmd:ky_fit} fits 8 finite mixture models of earnings and measurement 
errors of various types using linked survey and administrative data on earnings 
(or similar variables). The first four models were proposed by Kapteyn and 
Ypma (2007); the second four models are generalisations of the Kapteyn-Ypma 
models proposed by Jenkins and Rios-Avila (2021c). We refer here to the general 
class of models as "KY" models. Other innovations relative to Kapteyn and Ypma 
(2007) are: (i) we allow model parameters to be functions of covariates, and 
(ii) we incorporate a potential non-zero correlation betweeen the latent true 
earnings and contamination error.

{pstd} The data comprise a set of observations ('workers', i = 1, ..., N), 
for each of which there are 2 measures of (log) earnings: (a) from a 
survey (denoted s_i) and (b) from an administrative record dataset (r_i). We 
denote latent true (log) earnings by e_i. We refer to the mean of x as mu_x, and
the standard deviation (SD) of x as sig_x. The different combinations of 
error-ridden and/or error-free observations characterize latent classes. 
Latent class probabilities depend on the probabilities of the different 
types of error. 

{pstd} Each of the 8 models is a finite mixture of up to nine bivariate normal 
distributions. In {cmd:ky_fit} we label the KY models 1{c -}8, where
model 1, the Basic Model, is the simplest, and model 8, the Extended Model, is the 
most general. Next we set out out the structure of the Extended Model; the other
models are special cases of this. See Jenkins and Rios-Avila (2021b, c) for 
further details.

{pstd} The distribution of administrative earnings contains observations
that are correctly linked to survey records with probability pi_r,
as well as observations for which the linkage is incorrect ('mismatch') with
probability (1{c -}pi_r). Even if observations are correctly linked, some 
values of r_i may be subject to regression-to-the-mean (RTM) measurement error 
with probability (1{c -}pi_v). For mismatched individuals, observed 
administrative earnings are a draw from the distribution of earnings in the 
administrative data. In sum, there are three types of r_i observation:

{p 8 12 2}R1: r_i = e_i, with probability   pi_r * pi_v {p_end}
{p 8 12 2}R2: r_i = e_i + rho_r*(e_i{c -}mu_e) + v_i , with probability  pi_r * (1 {c -} pi_v) {p_end}
{p 8 12 2}R3: r_i = t_i, with probability 1 {c -} pi_r {p_end}

{pstd} The distribution of survey earnings contains three types of observation. 
First, there are observations with earnings that are reported correctly 
(i.e. without error), with probability pi_s. Second, there are observations with 
earnings reported with measurement error with the error including a RTM 
component, with probability (1{c -}pi_s)*(1{c -}pi_w). Third, there are 
observations that contain additional 'contamination' error in addition to 
RTM measurement error, with probability (1{c -}pi_s)*pi_w. In sum, there are
three types of s_i observation:

{p 8 12 2}S1: s_i = e_i, with probability pi_s {p_end}
{p 8 12 2}S2: s_i = e_i + rho_s*(e_i{c -}mu_e) + n_i
, with probability (1-pi_s)*(1-pi_w) {p_end}
{p 8 12 2}S3: s_i = e_i + rho_s*(e_i{c -}mu_e) + n_i + w_i
, with probability (1{c -}pi_s)*pi_w  {p_end}

{pstd} For model fitting, we follow KY and assume that errors are independently 
and identically distributed normal, with the exception of e_i and w_i,
for which we assume a bivariate normal distribution with correlation rho_w:

{p 8 12 2}(e_i, w_i) ~ BN([mu_e,mu_w], [(sig_e)^2,(sig_w)^2],rho_w*sig_e*sig_w) {p_end}
{p 8 12 2}n_i ~ N(mu_n, (sig_n)^2) {p_end}
{p 8 12 2}v_i ~ N(mu_v, (sig_v)^2) {p_end}
{p 8 12 2}t_i ~ N(mu_t, (sig_t)^2) {p_end}

{pstd} The Extended Model, {cmd:ky_fit}'s model 8, has nine latent classes. It
is a mixture of nine bivariate distributions representing combinations of the 
three types of administrative data observation and the three types of survey 
data observation. Class 1 contains 'completely labeled' observations, i.e. those
for which survey and administrative data earnings measures are error-free and
hence also equal to latent true earnings.

{p 8 12 2}Class 1: r_i ~ R1 and s_i ~ S1, with probability pi_r*pi_v*pi_s {p_end}
{p 8 12 2}Class 2: r_i ~ R1 and s_i ~ S2, with probability pi_r*pi_v*(1{c -}pi_s)*(1{c -}pi_w) {p_end}
{p 8 12 2}Class 3: r_i ~ R1 and s_i ~ S3, with probability pi_r*pi_v*(1{c -}pi_s)*pi_w {p_end}
{p 8 12 2}Class 4: r_i ~ R2 and s_i ~ S1, with probability pi_r*(1{c -}pi_v)*pi_s {p_end}
{p 8 12 2}Class 5: r_i ~ R2 and s_i ~ S2, with probability pi_r*(1{c -}pi_v)*(1{c -}pi_s)*(1{c -}pi_w) {p_end}
{p 8 12 2}Class 6: r_i ~ R2 and s_i ~ S3, with probability pi_r*(1{c -}pi_v)*(1{c -}pi_s)*pi_w {p_end}
{p 8 12 2}Class 7: r_i ~ R3 and s_i ~ S1, with probability (1{c -}pi_r)*pi_s {p_end}
{p 8 12 2}Class 8: r_i ~ R3 and s_i ~ S2, with probability (1{c -}pi_r)*(1{c -}pi_s)*(1{c -}pi_w) {p_end}
{p 8 12 2}Class 9: r_i ~ R3 and s_i ~ S3, with probability (1{c -}pi_r)*(1{c -}pi_s)*pi_w {p_end}

{pstd} Models 1 and 2 assume that the administrative data contain no error (i.e.
no mismatch and no measurement error). {p_end}

{pstd} Models 3, 4, and 7 assume that the administrative data contain only 
mismatch error. {p_end}

{pstd} Models 5, 6, and 8 assume that the administrative data contain mismatch 
error and RTM error. {p_end}

{pstd} Models 1, 3, and 6 assume that the survey data contain only RTM 
measurement error. {p_end}

{pstd} Models 2, 4, 5, 7 and 8 assume that the survey data contain RTM 
measurement error plus contamination error. {p_end}

{pstd} Models 7 and 8 assume the survey contamination error is correlated with 
the latent true earnings. {p_end}

{pstd} When fitting the models by maximum likelihood, we transform parameters 
other than means to ensure their estimates lie within their theoretical ranges.
That is, we ensure that all standard deviations are strictly positive; the RTM 
parameters and correlation between w_i and e_i lie between {c -}1 and 1; and 
error probabilities pi_s, pi_r, pi_w, and pi_v each lie between 0 and 1. {p_end}

{pstd}To report parameters in their natural metric, invert the transformations: {p_end}
{p 8 12 2}sig_x = exp(ln_sig_x) for x = e, n, w, t, v {p_end}
{p 8 12 2}rho_x = tanh(arho_x) for x = r, s, w {p_end}
{p 8 12 2}pi_x = logistic(lpi_x) for x = s, w, r, v {p_end}

{pstd} We provide post-estimation utilities {cmd:ky_estat} and {cmd:ky_p} to 
enable users to derive parameters (and SEs) in their natural metric. See
Jenkins and Rios-Avila (2021b, 2021c) for details.

{pstd} Users should experiment with multiple sets of initial values to check 
that models converge to a global maximum rather than some local maximum. (The 
risk of convergence to local maxima is greater for models with covariates.)
{cmd:ky_fit} fits models in a sequential fashion, beginning with simpler models
that provide starting values for more complex models. This reduces the risk of 
convergence to local maxima but does not remove it altogether.

{pstd}See Kapteyn and Ypma (2007) for details of models 1{c -}4, and estimates 
for a sample of Swedish workers aged 50+ years. For the same models, Jenkins 
and Rios-Avila (2020) provide estimates for a sample of UK workers from 
across the full range, also analyzing estimate sensitivity to the choice of the 
'completely labelled' fraction (the size of class 1). Meijer, Rohwedder, and 
Wansbeek (2012) derive hybrid earnings predictors of latent true earnings 
combining information from administrative and survey data and model estimates,
and measures of reliability. They illustrate their methods using Kapteyn and
Ypma's (2007) Full model (what we label model 4). Jenkins and Rios-Avila (2021a)
replicate Meijer, Rohwedder, and Wansbeek's analysis, and apply their methods
to estimates of model 4 derived from UK data. {p_end}

{pstd} See Jenkins and Rios-Avila (2021c) for discussion of models 5{c -}8, and 
estimates for models with and without parameters expressed as functions of 
covariates. They use UK data. Jenkins and Rios-Avila (2021b) discuss in greater 
detail model fitting using {cmd:ky_fit} and post-estimation methods, 
including the methods of Meijer, Rohwedder, and Wansbeek (2012) for models 
1{c -}8. {p_end}

{marker examples}{...}
{title:Examples}
{pstd}

{pstd} For detailed example, please see do-file "{stata doedit ky_example.do:ky_example.do}". 
{p_end}

{marker results}{...}
{title:Stored results}

{pstd}
In addition to the results stored from {cmd:ml} (see {help maximize}), 
{cmd:ky_fit} stores the following in {cmd:e()}:{p_end}

{synoptset 20 tabbed}{...}
{p2col 5 20 24 2: Scalars}{p_end}
{synopt:{cmd:e(ic)}}Number of iterations of last model.
 Does not include iterations of intermediate models, if any.{p_end}
{synopt:{cmd:e(method_c)}}Code for estimated model. See model specification.{p_end}


{synoptset 20 tabbed}{...}
{p2col 5 20 24 2: Macros}{p_end}
{synopt:{cmd:e(predict)}}{cmd:ky_p}: program used to implement {cmd:predict}.{p_end}
{synopt:{cmd:e(estat_cmd)}}{cmd:ky_estat}: program used to implement post-estimation statistics {cmd:estat}.{p_end}
{synopt:{cmd:e(method)}}Description of model specification.{p_end}
{synopt:{cmd:e(cmd)}}{cmd:ky_fit}{p_end}
{synopt:{cmd:e(cmdline)}}command as typed{p_end}
{synopt:{cmd:e(depvar)}}names of dependent variables, i.e. the administrative 
and survey log earnings variables, and completely-labeled group identifier{p_end}
{synopt:{cmd:e(user)}}name of likelihood-evaluator program. ky_ll_# with # = 
1, 2, ..., 8.{p_end}

{marker references}{...}
{title:References}

{pstd}Jenkins, S. P. and Rios-Avila, F. (2020). 
Measurement errors in survey and administrative data on employment earnings: 
sensitivity to the fraction assumed to have error-free earnings’, 
{it: Economics Letters}, 192: 109253. {browse "https://doi.org/10.1016/j.econlet.2020.109253"}

{pstd}Jenkins, S. P. and Rios-Avila, F. (2021a). 
Measurement error in earnings data: replication of 
Meijer, Rohwedder, and Wansbeek’s mixture model
approach to combining survey and register data, 
{it: Journal of Applied Econometrics}, online first. 
{browse "https://doi.org/10.1002/jae.2811"}

{pstd}Jenkins, S. P. and Rios-Avila, F. (2021b). 
Finite mixture models for linked survey and administrative data: estimation and 
post-estimation. IZA Discussion Paper, forthcoming. 
{browse "https://www.iza.org/publications/dp"}
For submission to {it:The Stata Journal}.

{pstd}Jenkins, S. P. and Rios-Avila, F. (2021c). 
Reconciling reports: modelling employment earnings and measurement errors 
using linked survey and administrative data.
IZA Discussion Paper, forthcoming. {browse "https://www.iza.org/publications/dp"}

{pstd}Kapteyn, A. and Ypma, Y. A. (2007). Measurement error and misclassification: a 
comparison of survey and administrative data. 
{it: Journal of Labor Economics} 25 (3): 513{c -}551.
{browse "https://www.journals.uchicago.edu/doi/abs/10.1086/513298"}

{pstd}Meijer, E., Rohwedder, S. and Wansbeek T. (2012). Measurement error in 
earnings data: using a mixture model approach to combine survey and register data. 
{it:Journal of Business & Economic Statistics} 30 (2): 191{c -}201.
{browse "https://www.tandfonline.com/doi/abs/10.1198/jbes.2011.08166"}


{marker authors}{...}
{title:Authors}


{pstd}
Stephen P. Jenkins {break}
Department of Social Policy{break}
London School of Economics and Political Science {break}
Houghton Street, London WC2A 2AE, UK{break}
Email: s.jenkins@lse.ac.uk

{pstd}
Fernando Rios-Avila{break}
Levy Economics Institute of Bard College{break}
Annandale-on-Hudson, NY 12504-5000, USA{break}
Email: friosavi@levy.org


{marker alsosee}{...}
{title:Also see}

{p 4 13 2}
{help ky_estat} if installed; {help ky_sim} if installed.