-------------------------------------------------------------------------------
help for pgmhaz8                            Stephen P. Jenkins (September 2004)
-------------------------------------------------------------------------------

Discrete time (grouped data) proportional hazards models

        pgmhaz8 covariates [weight] [if exp] [in range] [, id(idvar)
                 dead(deadvar) seq(seqvar) lnvar0(#) eform nocons nolog
                 nobeta0 level(#) maximize_options]

    by ... : may be used with pgmhaz8; see help by.

    fweights and iweights are allowed; see help weights.


Description

    pgmhaz8 estimates by ML two discrete time (grouped data) proportional
    hazards regression models, one of which incorporates a gamma mixture
    distribution to summarize unobserved individual heterogeneity (or
    'frailty').  Covariates may include regressor variables summarizing
    observed differences between persons (either fixed or time-varying), and
    variables summarizing the duration dependence of the hazard rate.  With
    suitable definition of covariates, models with a fully non-parametric
    specification for duration dependence may be estimated; so too may
    parametric specifications.  pgmhaz8 thus provides a useful complement to
    stcox (for continuous survival time data), and related programs.  For
    further discussion of hazard regression models with unobserved
    heterogeneity, see e.g. 
    http://www.iser.essex.ac.uk/teaching/stephenj/ec968/index.php, espcially
    `Lesson 8'. The Lesson also shows how to derive predicted hazard and
    survivor functions from the estimates of the model.


    pgmhaz8 estimates two models by maximum likelihood:  (1) the
    Prentice-Gloeckler (1978) model; and (2) the Prentice-Gloeckler (1978)
    model incorporating a gamma mixture distribution to summarize unobserved
    individual heterogeneity, as proposed by Meyer (1990).  Specifically, the
    Prentice-Gloeckler-Meyer models estimated are those described by Meyer
    (1990, equations 5 and 7).

    Suppose there are individuals i = 1,...,N, who each enter a state (e.g.
    unemployment) at time t = 0 and are observed for k_i time periods, at
    which point each person either remains in the state (censored duration
    data) or leaves the state (complete duration data).

    The log-likelihood function for Model (2) is:

            _                                                       _
     i=N   |   _   k_i-1       _             _    k_i        _       |
     ---   |  |     ---         |-v^-1      |     ---         |-v^-1 |
     \     |  |     \           |           |     \           |      |
     /  log|  |1+ v./  exp(I_it)|     - c_i.|1+ v./  exp(I_it)|      |
     ---   |  |     ---         |           |     ---         |      |
     i=1   |  |_    t=0        _|           |_    t=0        _|      |
           |_                                                       _|


    c_i is a censoring indicator (= 1 if event, 0 otherwise); I_it is an
    index function, X_it*b, incorporating the impact of covariates X_it; and
    v is the variance of the gamma mixing distribution (the mean of which is
    normalized to equal one).  Model 1 is the limiting case as v -> 0.

    For suitably re-organised data, the log-likelihood function for Model (1)
    is the same as the log-likelihood for a generalized linear model of the
    binomial family with complementary log-log link: see Allison (1982) or
    Jenkins (1995).  Model (1) is estimated using the glm command (cloglog
    would have produced the same estimates).  To estimate the discrete time
    proportional hazards model with normally (rather than Gamma) distributed
    heterogeneity, use xtcloglog with the data organized as here.

    Model (2) is estimated using ml d0, with starting values for the
    coefficients b taken from Model (1)'s estimates. The program estimates
    the log of the Gamma variance, with default value equal to -1, i.e. a
    variance of c. 0.37). Different starting values for the log variance may
    be set optionally: see below.

Important note about data organization and variables

    The data set must be organised beforehand so that there is a data row
    corresponding to each time period at risk of the event for each subject.
    (This corresponds to `episode-splitting' at each each and every period.) 
    expand is useful for putting the data in this form.  Also see the `data
    step' discussion in Jenkins (1995).

Options

    The first three options are required:

    id(idvar) specifies the variable uniquely identifying each subject, i.

    seq(seqvar) is the variable uniquely identifying each time period at risk
        for each subject. For each i, the variable is the integer sequence 1,
        2, ..., k_i.

    dead(deadvar) summarizes censoring status during each time period at
        risk.  If c_i = 0, deadvar = 0 for all t = 1, 2, ..., k_i; if c_i =
        1, deadvar = 0 for all t = 1,2,...,(k_i)-1, and deadvar = 1 for t =
        k_i.

    lnvar0(#) specifies the value for the log of the Gamma variance that is
        used as the starting value in the maximization. The default is -1.

    eform reports the coefficients transformed to hazard ratio format, i.e.
        exp(b) rather than b. Standard errors and confidence intervals are
        similarly transformed. eform may be specified at estimation or when
        redisplaying results.

    level(#) specifies the significance level, in percent, for confidence
        intervals of the parameters; see help level.

    nocons specifies no intercept term in the index function, X_it*b.

    nolog suppresses the iteration logs.

    nobeta0 suppresses reporting of the estimates from Model (1).

    maximize_options control the maximization process. The options available
        are those available for ml d0 as shown by help maximize, with the
        exception of from(). For difficult maximization problems, using the
        difficult or technique options may help convergence.  So too might
        different starting values for the Gamma variance parameter.


Saved results

    In addition to the usual results saved after ml, pgmhaz8 also saves the
    following:

    e(gammav) and e(se_gammav) are the estimated Gamma variance and its
    standard error.

    e(ll_nofr) is the log-likelihood value from Model (1).

    e(lltest) is the likelihood ratio test statistic for the test of Model
    (1) versus Model (2), i.e. for the hypothesis that the Gamma variance is
    equal to zero, and e(lltest_p) is the associated p-value.

    e(depvar), e(deadvar), and e(seqvar) contain the names of depvar,
    deadvar, and seqvar.

    e(N_spell) is the number of spells in the data. (Cf. e(N), the total
    number of `person-period' observations in the estimation data set.)


Examples

    . sysuse cancer, clear

    . ge id = _n

    . recode drug 1=0 2/3=1

    . expand studytime

    . bysort id: ge t = _n // 'survival time', t, is # periods at risk per
        subject

    . bysort id: ge dead = died & _n==_N // event indicator

    . // duration dependence: log(time)

    . ge logt = ln(t)

    . pgmhaz drug age logt, id(id) seq(t) d(dead) nolog

    . pgmhaz, eform

    . // duration dependence: piece-wise constant

    . ge byte dur1 = d1+d2+d3+d4+d5+d6

    . ge byte dur2 = d7+d8+d9+d10+d11+d12

    . ge byte dur3 = d13+d14+d15+d16+d17+d18

    . ge byte dur4 = d19+d20+d21+d22+d23+d24

    . ge byte dur5 = d25+d26+d27+d28+d29+d30

    . ge byte dur6 = d31+d32+d33+d34+d35+d36+d37+d38+d39

    . pgmhaz8 drug age dur1-dur6, i(id) s(t) d(dead) nocons

    . // duration dependence: non-parametric (period-specific)

    . ta t, ge(d) // Create period-specific dummy variables, d*

    . ta t dead // Check whether events in each period; only use periods with
        events

    . egen include = eqany(d1-d8 d10-d13 d15-d17 d22-d25 d28 d33), v(1)

    . pgmhaz8 drug age d1-d8 d10-d13 d15-d17 d22-d25 d28 d33 ///

            if include, i(id) s(t) d(dead) nocons


Author

    Stephen P. Jenkins <stephenj@essex.ac.uk>, Institute for Social and
    Economic Research, University of Essex, Colchester CO4 3SQ, U.K.

Acknowledgements

    Adrienne tenCate found a bug in the handling of fweighted data, and Jeff
    Pitblado showed me how to fix it.


References

    Allison, P.D. (1982). Discrete-time methods for the analysis of event
        histories. In Sociological Methodology 1982, ed. S. Leinhardt, San
        Francisco: Jossey-Bass Publishers, 61-97.

    Jenkins, S.P. (1995). Easy estimation methods for discrete-time duration
        models. Oxford Bulletin of Economics and Statistics 57 (1): 129-138.

    Meyer, B.D. (1990). Unemployment insurance and unemployment spells.
        Econometrica 58 (4): 757-782.

    Prentice, R. and Gloeckler L. (1978). Regression analysis of grouped
        survival data with application to breast cancer data.  Biometrics 34:
        57-67.


Also see

    Help for stcox, glm, cloglog, xtcloglog, expand, and hshaz if installed.
             Users of Stata versions 5 to 8.1 may estimate Models (1) and (2)
             using pgmhaz, published in STB-39. Install it using net stb 39
             sbe17.