help tpm also see: tpm postestimation -------------------------------------------------------------------------------

Title

tpm -- Two-part models

Syntax

Same regressors in both parts

tpm depvar [indepvars] [if] [in] [weight] [, tpm_options]

Different regressors

tpm equation1 equation2 [if] [in] [weight] [, tpm_options]

where equation1 and equation2 are specified as

( [eqname: ] depvar [=] [indepvars] )

tpm_options Description ------------------------------------------------------------------------- Model firstpart(f_options) specify the model for the first part secondpart(s_options) specify the model for the second part

SE/Robust vce(vcetype) vcetype may be conventional, robust, cluster clustvar, bootstrap, or jackknife robust synonym for vce(robust) cluster(clustvar) synonym for vce(cluster clustvar) suest combine the estimation results of first and second part to derive a simultaneous (co)variance matrix of the sandwich/robust type

Reporting level(#) set confidence level; default is level(95) nocnsreport do not display constraints display_options control spacing and display of omitted variables and base and empty cells ------------------------------------------------------------------------- indepvars may contain factor variables; see fvvarlist. depvar and indepvars may contain time-series operators; see tsvarlist. bootstrap, by, jackknife, nestreg, rolling, statsby, stepwise, and svy are allowed; see prefix. Weights are not allowed with the bootstrap prefix. aweights are not allowed with the jackknife prefix. vce() and weights are not allowed with the svy prefix. aweights, fweights, pweights, and iweights are allowed; see weight. coeflegend does not appear in the dialog box. See tpm_postestimation for features available after estimation.

f_options Description ------------------------------------------------------------------------- Model logit [, logit_options] specifies the model for the binary, first part outcome as a logistic regression probit [, probit_options] specifies the model for the binary, first part outcome as a probit regression -------------------------------------------------------------------------

s_options Description ------------------------------------------------------------------------- Model glm [, glm_options] specifies the model for the second part outcome as a generalized linear model regress [, regress_options] specifies the model for the continuous, second part outcome as a linear regression estimated using OLS -------------------------------------------------------------------------

Description

tpm fits a two-part regression model of depvar on indepvars. The first part models the probability that depvar>0 using a binary choice model (logit or probit). The second part models the distribution of depvar | depvar>0 using linear (regress) and generalized linear models (glm).

Options

+-------+ ----+ Model +------------------------------------------------------------

firstpart(string) specifies the first part of the model for a binary outcome. It is not optional. It should be logit or probit. Each can be specified with its options, except vce() which should be specified as a tpm option.

secondpart(string) specifies the second part of the model for a positive outcome. It is not optional. It should be regress or glm. Each can be specified with options, except vce() which should be specified as a tpm option.

+-----------+ ----+ SE/Robust +--------------------------------------------------------

vce(vcetype) specifies the type of standard error reported, which includes types that are derived from asymptotic theory, that are robust to some kinds of misspecification, that allow for intragroup correlation, and that use bootstrap or jackknife methods; see [R] vce_option.

vce(conventional), the default, uses the conventionally derived variance estimators for first and second part models.

Note that options related to the variance estimators for both parts must be specified using vce(vcetype) in the tpm syntax. Specifying vce(robust) is equivalent to specifying vce(cluster clustvar).

suest combines the estimation results of first and second part to derive a simultaneous (co)variance matrix of the sandwich/robust type. Typical applications of suest are tests for cross-part hypotheses using test or testnl.

Options for the first part: logit

+-------+ ----+ Model +------------------------------------------------------------

noconstant, offset(varname), constraints(constraints), collinear; see [R] estimation options.

asis forces retention of perfect predictor variables and their associated perfectly predicted observations and may produce instabilities in maximization; see [R] probit.

+--------------+ ----+ Maximization +-----------------------------------------------------

maximize_options: difficult, technique(algorithm_spec), iterate(#), [no]log, trace, gradient, showstep, hessian, showtolerance, tolerance(#), ltolerance(#), nrtolerance(#), nonrtolerance, from(init_specs); see [R] maximize. These options are seldom used.

Options for the first part: probit

+-------+ ----+ Model +------------------------------------------------------------

noconstant, offset(varname), constraints(constraints), collinear; see [R] estimation options.

asis specifies that all specified variables and observations be retained in the maximization process. This option is typically not specified and may introduce numerical instability. Normally probit drops variables that perfectly predict success or failure in the dependent variable along with their associated observations. In those cases, the effective coefficient on the dropped variables is infinity (negative infinity) for variables that completely determine a success (failure). Dropping the variable and perfectly predicted observations has no effect on the likelihood or estimates of the remaining coefficients and increases the numerical stability of the optimization process. Specifying this option forces retention of perfect predictor variables and their associated observations.

+--------------+ ----+ Maximization +-----------------------------------------------------

maximize_options: difficult, technique(algorithm_spec), iterate(#), [no]log, trace, gradient, showstep, hessian, showtolerance, tolerance(#), ltolerance(#), nrtolerance(#), nonrtolerance, from(init_specs); see [R] maximize. These options are seldom used.

Options for the second part: glm

+-------+ ----+ Model +------------------------------------------------------------

family(familyname) specifies the distribution of depvar; family(gaussian) is the default. link(linkname) specifies the link function; the default is the canonical link for the family() specified.

+---------+ ----+ Model 2 +----------------------------------------------------------

noconstant, exposure(varname), offset(varname), constraints(constraints), collinear; see [R] estimation options. constraints(constraints) and collinear are not allowed with irls.

mu(varname) specifies varname as the initial estimate for the mean of depvar. This option can be useful with models that experience convergence difficulties, such as family(binomial) models with power or odds-power links. init(varname) is a synonym.

disp(#) multiplies the variance of depvar by # and divides the deviance by #. The resulting distributions are members of the quasilikelihood family.

scale(x2|dev|#) overrides the default scale parameter. This option is allowed only with Hessian (information matrix) variance estimates.

By default, scale(1) is assumed for the discrete distributions (binomial, Poisson, and negative binomial), and scale(x2) is assumed for the continuous distributions (Gaussian, gamma, and inverse Gaussian).

scale(x2) specifies that the scale parameter be set to the Pearson chi-squared (or generalized chi-squared) statistic divided by the residual degrees of freedom, which is recommended by McCullagh and Nelder (1989) as a good general choice for continuous distributions.

scale(dev) sets the scale parameter to the deviance divided by the residual degrees of freedom. This option provides an alternative to scale(x2) for continuous distributions and overdispersed or underdispersed discrete distributions.

scale(#) sets the scale parameter to #. For example, using scale(1) in family(gamma) models results in exponential-errors regression. Additional use of link(log) rather than the default link(power -1) for family(gamma) essentially reproduces Stata's streg, dist(exp) nohr command (see [ST] streg) if all the observations are uncensored.

+--------------+ ----+ Maximization +-----------------------------------------------------

ml requests that optimization be carried out using Stata's ml commands and is the default.

irls requests iterated, reweighted least-squares (IRLS) optimization of the deviance instead of Newton-Raphson optimization of the log likelihood. If the irls option is not specified, the optimization is carried out using Stata's ml commands, in which case all options of ml maximize are also available.

maximize_options: difficult, technique(algorithm_spec), iterate(#), [no]log, trace, gradient, showstep, hessian, showtolerance, tolerance(#), ltolerance(#), nrtolerance(#), nonrtolerance, from(init_specs); see [R] maximize. These options are seldom used.

Setting the optimization type to technique(bhhh) resets the default vcetype to vce(opg).

fisher(#) specifies the number of Newton-Raphson steps that should use the Fisher scoring Hessian or EIM before switching to the observed information matrix (OIM). This option is useful only for Newton-Raphson optimization (and not when using irls).

search specifies that the command search for good starting values. This option is useful only for Newton-Raphson optimization (and not when using irls).

familyname Description ------------------------------------------------------------------------- gaussian Gaussian (normal) gamma gamma -------------------------------------------------------------------------

linkname Description ------------------------------------------------------------------------- identity identity log log power # power -------------------------------------------------------------------------

Options for the second part: regress

+-------+ ----+ Model +------------------------------------------------------------

log specifies that the linear regression be estimated on the logarithm of the second part, continuous outcome.

+---------+ ----+ Model 2 +----------------------------------------------------------

noconstant; see [R] estimation options.

+-----------+ ----+ Reporting +--------------------------------------------------------

level(#); see [R] estimation options.

nocnsreport; see [R] estimation options.

display_options: noomitted, vsquish, noemptycells, baselevels, allbaselevels; see [R] estimation options.

Remarks tpm is designed to estimate models in which the positive outcome is continuous. It does not deal with discrete or count outcomes. It also does not allow boxcox or other models that may be appropriate for continuous outcomes.

tpm assumes that the list of covariates are the same for the index function in each part. Such a restriction is not required or appropriate for all two-part model applications.

Examples

Setup . webuse womenwk, clear . replace wage = 0 if wage==.

Two part model with logit and glm with Gaussian family and identity link . tpm wage educ age married children, first(logit) second(glm) Two part model with probit and glm with gamma family and log link . tpm wage educ age married children, f(probit) s(glm, fam(gamma) link(log))

Two part model with probit and linear regression . tpm wage educ age married children, f(probit) s(regress)

Two part model with probit and linear regression of log(depvar>0) . tpm wage educ age married children, f(probit) s(regress, log)

Saved results

if probit is specified as first part tpm saves the following in e():

Scalars e(N_probit) number of observations e(N_cds_probit) number of completely determined successes e(N_cdf_probit) number of completely determined failures e(k_probit) number of parameters e(k_eq_probit) number of equations in e(b) e(k_eq_model_probit) number of equations in model (Wald test) e(k_dv_probit) number of dependent variables e(k_autoCns_probit) number of base, empty, and omitted constraints e(df_m_probit) model degrees of freedom e(r2_p_probit) pseudo-R-squared e(ll_probit) log likelihood e(ll_0_probit) log likelihood, contant-only model e(N_clust_probit) number of clusters e(chi2_probit) chi-squared e(p_probit) significance e(rank_probit) rank of e(V) e(ic_probit) number of iterations e(rc_probit) return code e(converged_probit) 1 if converged, 0 otherwise

Macros e(offset_probit) offset e(chi2type_probit) Wald or LR; type of model chi-squared test e(opt_probit) type of optimization e(which_probit) max or min; whether optimizer is to perform maximization or minimization e(ml_method_probit) type of ml method e(user_probit) name of likelihood-evaluator program e(technique_probit) maximization technique e(singularHmethod_probit) m-marquardt or hybrid; method used when Hessian is singular e(crittype_probit) optimization criterion e(asbalanced_probit) factor variables fvset as asbalanced e(asobserved_probit) factor variables fvset as asobserved

if logit is specified as first part tpm saves the following in e():

Scalars e(N_logit) number of observations e(N_cds_logit) number of completely determined successes e(N_cdf_logit) number of completely determined failures e(k_logit) number of parameters e(k_eq_logit) number of equations in e(b) e(k_eq_model_logit) number of equations in model Wald test e(k_dv_logit) number of dependent variables e(k_autoCns_logit) number of base, empty, and omitted constraints e(df_m_logit) model degrees of freedom e(r2_p_logit) pseudo-R-squared e(ll_logit) log likelihood e(ll_0_logit) log likelihood, contant-only model e(N_clust_logit) number of clusters e(chi2_logit) chi-squared e(p_logit) significance e(rank_logit) rank of e(V) e(ic_logit) number of iterations e(rc_logit) return code e(converged_logit) 1 if converged, 0 otherwise

Macros e(offset_logit) offset e(chi2type_logit) Wald or LR; type of model chi-squared test e(opt_logit) type of optimization e(which_logit) max or min; whether optimizer is to perform maximization or minimization e(ml_method_logit) type of ml method e(user_logit) name of likelihood-evaluator program e(technique_logit) maximization technique e(singularHmethod_logit) m-marquardt or hybrid; method used when Hessian is singular e(crittype_logit) optimization criterion e(asbalanced_logit) factor variables fvset as asbalanced e(asobserved_logit) factor variables fvset as asobserved

if glm is specified as second part tpm saves the following in e():

Scalars e(N_glm) number of observations e(k_glm) number of parameters e(k_eq_glm) number of equations in e(b) e(k_eq_model_glm) number of equations in model Wald test e(k_dv_glm) number of dependent variables e(k_autoCns_glm) number of base, empty, and omitted constraints e(df_m_glm) model degrees of freedom e(df_glm) residual degrees of freedom e(phi_glm) scale parameter e(aic_glm) model AIC e(bic_glm) model BIC e(ll_glm) log likelihood, if NR e(N_clust_glm) number of clusters e(chi2_glm) chi-squared e(p_glm) significance e(deviance_glm) deviance e(deviance_s_glm) scaled deviance e(deviance_p_glm) Pearson deviance e(deviance_ps_glm) scaled Pearson deviance e(dispers_glm) dispersion e(dispers_s_glm) scaled dispersion e(dispers_p_glm) Pearson dispersion e(dispers_ps_glm) scaled Pearson dispersion e(nbml_glm) 1 if negative binomial parameter estimated via ML, 0 otherwise e(vf_glm) factor set by vfactor(), 1 if not set e(power_glm) power set by power(), opower() e(rank_glm) rank of e(V) e(ic_glm) number of iterations e(rc_glm) return code e(converged_glm) 1 if converged, 0 otherwise

Macros e(varfunc_glm) name of variance function used e(varfunct_glm) Gaussian, Inverse Gaussian, Binomial, Poisson, Neg. Binomial, Bernoulli, Power, or Gamma e(varfuncf_glm) variance function e(link_glm) name of link function used e(linkt_glm) link title e(linkf_glm) link form e(m_glm) number of binomial trials e(offset_glm) offset e(chi2type_glm) Wald or LR; type of model chi-squared test e(cons_glm) set if noconstant specified e(hac_kernel_glm) HAC kernel e(hac_lag_glm) HAC lag e(opt_glm) ml or irls e(opt1_glm) optimization title, line 1 e(opt2_glm) optimization title, line 2 e(which_glm) max or min; whether optimizer is to perform maximization or minimization e(ml_method_glm) type of ml method e(user_glm) name of likelihood-evaluator program e(technique_glm) maximization technique e(singularHmethod_glm) m-marquardt or hybrid; method used when Hessian is singular e(crittype_glm) optimization criterion e(asbalanced_glm) factor variables fvset as asbalanced e(asobserved_glm) factor variables fvset as asobserved

tpm saves the following in e():

Macros e(cmd) tpm e(cmdline) command as typed e(depvar) name of dependent variable e(wtype) weight type e(wexp) weight expression e(title) title in estimation output e(clustvar) name of cluster variable e(vce) vcetype specified in vce() e(vcetype) title used to label Std. Err. e(properties) b V e(estat_cmd) program used to implement estat e(predict) program used to implement predict

Matrices e(b) coefficient vector e(gradient) gradient vector e(V) variance-covariance matrix of the estimators e(V_modelbased) model-based variance

Functions e(sample) marks estimation sample (first part)