------------------------------------------------------------------------------- help forpgmhaz8Stephen P. Jenkins (September 2004) -------------------------------------------------------------------------------

Discrete time (grouped data) proportional hazards models

pgmhaz8covariates[weight] [ifexp] [inrange] [,id(idvar)dead(deadvar)seq(seqvar)lnvar0(#)eformnoconsnolognobeta0level(#)maximize_options]

by...:may be used withpgmhaz8; see help by.

fweights andiweights are allowed; see help weights.

Description

pgmhaz8estimates by ML two discrete time (grouped data) proportional hazards regression models, one of which incorporates a gamma mixture distribution to summarize unobserved individual heterogeneity (or 'frailty'). Covariates may include regressor variables summarizing observed differences between persons (either fixed or time-varying), and variables summarizing the duration dependence of the hazard rate. With suitable definition of covariates, models with a fully non-parametric specification for duration dependence may be estimated; so too may parametric specifications.pgmhaz8thus provides a useful complement tostcox(for continuous survival time data), and related programs. For further discussion of hazard regression models with unobserved heterogeneity, see e.g. http://www.iser.essex.ac.uk/teaching/stephenj/ec968/index.php, espcially `Lesson 8'. The Lesson also shows how to derive predicted hazard and survivor functions from the estimates of the model.

pgmhaz8estimates two models by maximum likelihood: (1) the Prentice-Gloeckler (1978) model; and (2) the Prentice-Gloeckler (1978) model incorporating a gamma mixture distribution to summarize unobserved individual heterogeneity, as proposed by Meyer (1990). Specifically, the Prentice-Gloeckler-Meyer models estimated are those described by Meyer (1990, equations 5 and 7).Suppose there are individuals

i= 1,...,N, who each enter a state (e.g. unemployment) at timet= 0 and are observed fork_itime periods, at which point each person either remains in the state (censored duration data) or leaves the state (complete duration data).The log-likelihood function for Model (2) is:

_ _ i=N | _ k_i-1 _ _ k_i _ | --- | | --- |-v^-1 | --- |-v^-1 | \ | | \ | | \ | | / log| |1+ v./ exp(I_it)| - c_i.|1+ v./ exp(I_it)| | --- | | --- | | --- | | i=1 | |_ t=0 _| |_ t=0 _| | |_ _|

c_iis a censoring indicator (= 1 if event, 0 otherwise);I_itis an index function,X_it*b, incorporating the impact of covariatesX_it; andvis the variance of the gamma mixing distribution (the mean of which is normalized to equal one). Model 1 is the limiting case asv-> 0.For suitably re-organised data, the log-likelihood function for Model (1) is the same as the log-likelihood for a generalized linear model of the binomial family with complementary log-log link: see Allison (1982) or Jenkins (1995). Model (1) is estimated using the glm command (cloglog would have produced the same estimates). To estimate the discrete time proportional hazards model with normally (rather than Gamma) distributed heterogeneity, use xtcloglog with the data organized as here.

Model (2) is estimated using

ml d0, with starting values for the coefficientsbtaken from Model (1)'s estimates. The program estimates the log of the Gamma variance, with default value equal to -1, i.e. a variance of c. 0.37). Different starting values for the log variance may be set optionally: see below.

Important note about data organization and variablesThe data set must be organised beforehand so that there is a data row corresponding to each time period at risk of the event for each subject. (This corresponds to `episode-splitting' at each each and every period.) expand is useful for putting the data in this form. Also see the `data step' discussion in Jenkins (1995).

OptionsThe first three options are required:

id(idvar)specifies the variable uniquely identifying each subject,i.

seq(seqvar)is the variable uniquely identifying each time period at risk for each subject. For eachi, the variable is the integer sequence 1, 2, ...,k_i.

dead(deadvar)summarizes censoring status during each time period at risk. Ifc_i= 0, deadvar = 0 for allt= 1, 2, ...,k_i; ifc_i= 1,deadvar= 0 for allt= 1,2,...,(k_i)-1, anddeadvar= 1 fort=k_i.

lnvar0(#)specifies the value for the log of the Gamma variance that is used as the starting value in the maximization. The default is -1.

eformreports the coefficients transformed to hazard ratio format, i.e. exp(b) rather than b. Standard errors and confidence intervals are similarly transformed.eformmay be specified at estimation or when redisplaying results.

level(#)specifies the significance level, in percent, for confidence intervals of the parameters; see help level.

noconsspecifies no intercept term in the index function,X_it*b.

nologsuppresses the iteration logs.

nobeta0suppresses reporting of the estimates from Model (1).

maximize_optionscontrol the maximization process. The options available are those available forml d0as shown by help maximize, with the exception offrom(). For difficult maximization problems, using thedifficultortechniqueoptions may help convergence. So too might different starting values for the Gamma variance parameter.

Saved resultsIn addition to the usual results saved after

ml,pgmhaz8also saves the following:

e(gammav)ande(se_gammav)are the estimated Gamma variance and its standard error.

e(ll_nofr)is the log-likelihood value from Model (1).

e(lltest)is the likelihood ratio test statistic for the test of Model (1) versus Model (2), i.e. for the hypothesis that the Gamma variance is equal to zero, ande(lltest_p)is the associatedp-value.

e(depvar),e(deadvar), ande(seqvar)contain the names ofdepvar,deadvar, andseqvar.

e(N_spell)is the number of spells in the data. (Cf.e(N), the total number of `person-period' observations in the estimation data set.)

Examples. sysuse cancer, clear

. ge id = _n

. recode drug 1=0 2/3=1

. expand studytime

. bysort id: ge t = _n // 'survival time', t, is # periods at risk per subject

. bysort id: ge dead = died & _n==_N // event indicator

. // duration dependence: log(time)

. ge logt = ln(t)

. pgmhaz drug age logt, id(id) seq(t) d(dead) nolog

. pgmhaz, eform

. // duration dependence: piece-wise constant

. ge byte dur1 = d1+d2+d3+d4+d5+d6

. ge byte dur2 = d7+d8+d9+d10+d11+d12

. ge byte dur3 = d13+d14+d15+d16+d17+d18

. ge byte dur4 = d19+d20+d21+d22+d23+d24

. ge byte dur5 = d25+d26+d27+d28+d29+d30

. ge byte dur6 = d31+d32+d33+d34+d35+d36+d37+d38+d39

. pgmhaz8 drug age dur1-dur6, i(id) s(t) d(dead) nocons

. // duration dependence: non-parametric (period-specific)

. ta t, ge(d) // Create period-specific dummy variables, d*

. ta t dead // Check whether events in each period; only use periods with events

. egen include = eqany(d1-d8 d10-d13 d15-d17 d22-d25 d28 d33), v(1)

. pgmhaz8 drug age d1-d8 d10-d13 d15-d17 d22-d25 d28 d33 ///

if include, i(id) s(t) d(dead) nocons

AuthorStephen P. Jenkins <stephenj@essex.ac.uk>, Institute for Social and Economic Research, University of Essex, Colchester CO4 3SQ, U.K.

AcknowledgementsAdrienne tenCate found a bug in the handling of fweighted data, and Jeff Pitblado showed me how to fix it.

ReferencesAllison, P.D. (1982). Discrete-time methods for the analysis of event histories. In

Sociological Methodology 1982, ed. S. Leinhardt, San Francisco: Jossey-Bass Publishers, 61-97.Jenkins, S.P. (1995). Easy estimation methods for discrete-time duration models.

Oxford Bulletin of Economics and Statistics57 (1): 129-138.Meyer, B.D. (1990). Unemployment insurance and unemployment spells.

Econometrica58 (4): 757-782.Prentice, R. and Gloeckler L. (1978). Regression analysis of grouped survival data with application to breast cancer data.

Biometrics34: 57-67.

Also seeHelp for stcox, glm, cloglog, xtcloglog, expand, and hshaz if installed. Users of Stata versions 5 to 8.1 may estimate Models (1) and (2) using

pgmhaz, published in STB-39. Install it usingnet stb 39sbe17.