-------------------------------------------------------------------------------
help for hshaz                               Stephen P. Jenkins (December 2005)
-------------------------------------------------------------------------------

Discrete time (grouped data) proportional hazards models

        hshaz covariates [weight] [if exp] [in range] [, id(idvar)
                 dead(deadvar) seq(seqvar) nmp(#) m2(#) p2(#) m3(#) p3(#)
                 m4(#) p4(#) m5(#) p5(#) eform nocons nolog nobeta0 level(#)
                 maximize_options]

    by ... : may be used with hshaz; see help by.

    fweights and iweights are allowed; see help weights.


Description

    hshaz estimates, by ML, two discrete time (grouped data) proportional
    hazards regression models, one of which incorporates a discrete mixture
    distribution to summarize unobserved individual heterogeneity (or
    'frailty').  Covariates may include regressor variables summarizing
    observed differences between persons (either fixed or time-varying), and
    variables summarizing the duration dependence of the hazard rate.  With
    suitable definition of covariates, models with a fully non-parametric
    specification for duration dependence may be estimated; so too may
    parametric specifications.  hshaz thus provides a useful complement to
    stcox (for continuous survival time data), and related programs.  For
    further discussion of hazard regression models with unobserved
    heterogeneity, see e.g. 
    http://www.iser.essex.ac.uk/teaching/stephenj/ec968/index.php, espcially
    `Lesson 7'. The Lesson also shows how to derive predicted hazard and
    survivor functions from the estimates of the model.

    hshaz estimates two models by maximum likelihood:  (1) the
    Prentice-Gloeckler (1978) model; and (2) the Prentice-Gloeckler (1978)
    model incorporating a discrete mixture distribution to summarize
    unobserved individual heterogeneity, as proposed by Heckman and Singer
    (1984).  (To estimate a model incorporating Gamma distributed
    heterogeneity, use pgmhaz8.)

    Suppose there are individuals i = 1,...,N, who each enter a state (e.g.
    unemployment) at time t = 0 and are observed for j time periods, at which
    point each person either remains in the state (censored duration data) or
    leaves the state (complete duration data).

    In Model 1, the discrete hazard rate in period t is

                         h_t = 1-exp(-exp(b0 + X_it*b))

    where b0 is an intercept and the linear index function, X_it*b,
    incorporates the impact of covariates X_it. The contribution to the
    sample likelihood for a subject with a spell length of j periods is

                           S(j) * (h_j/(1 - h_j))^c 

    where S(j) is the probability of remaining in the state j periods, i.e.
    the survivor function, and c is a censoring indicator, equal to one for a
    completed spell and zero otherwise.

    Suppose now that each individual belongs to one of a number of different
    types, and membership of each class is unobserved. This is parameterized
    by allowing the intercept term in the hazard function to differ across
    types. Thus, for a model with types z = 1,...,Z, the hazard function for
    an individual belonging to type z is:

                     hz_t = 1-exp(-exp(m_z + b0 + X_it*b))

    and the probability of belonging to type z is p_z. The m_z characterize
    the discrete points of support of a multinomial distribution (`mass
    points'), with m_1 normalized to equal zero and p_1 = 1 - SUM(z=2 to
    z=Z)[ p_z ]. The zth mass point equals m_z + b0.

    In hshaz, the default number of mass points is 2, but optionally may be
    set equal to 3, 4, or 5.

    The contribution to the sample likelihood of a subject with observed
    duration j is:

          L = SUM(z=1 to z=Z) [ p_z * Sz(j) * (hz_j/(1 - hz_j))^c ]  

    For suitably re-organised data, the log-likelihood function for Model 1
    is the same as the log-likelihood for a generalized linear model of the
    binomial family with complementary log-log link: see Allison (1982) or
    Jenkins (1995).  Model 1 is estimated using the glm command (cloglog
    produces the same estimates).

    Model 2 is estimated using ml d0. Heckman and Singer (1984) note
    potential problems with finding the global maxima rather multiple local
    maxima, and emphasize the benefits of estimating models with alternative
    sets of starting values to check convergence. Different starting values
    for the mass points and associated probabilities may be set optionally:
    see below. See also the discussion of maximize_options below.


Important note about data organization and variables

    The data set must be organised beforehand so that there is a data row
    corresponding to each time period at risk of the event for each subject.
    (This corresponds to `episode splitting' at each and every period.) 
    expand is useful for putting the data in this form.  Also see the `data
    step' discussion in Jenkins (1995).

Options

    The first three options are required:

    id(idvar) specifies the variable uniquely identifying each subject, i.

    seq(seqvar) is the variable uniquely identifying each time period at risk
        for each subject. For each i at risk j periods, the variable is the
        integer sequence 1, 2, ..., j_i.

    dead(deadvar) summarizes censoring status during each time period at
        risk.  If c_i = 0, deadvar = 0 for all t = 1, 2, ..., j_i; if c_i =
        1, deadvar = 0 for all t = 1,2,...,(j_i)-1, and deadvar = 1 for t =
        j_i.

    nmp(#) specifies the number of discrete points of support Z.  The default
        is two.

    m2(#), m3(#), m4(#), m5(#), specify starting values for m_2, m_3, m_4,
        m_5. The default values are 1, -1, 0.1, -0.1, respectively.

    p2(#), p3(#), p4(#), p5(#), specify starting values for the probabilities
        of belonging to types 2, 3, 4, and 5.  The default values are 0.3,
        0.3, 0.1, 0.1, respectively.

    eform reports the coefficients transformed to hazard ratio format, i.e.
        exp(b) rather than b. Standard errors and confidence intervals are
        similarly transformed. eform may be specified at estimation or when
        redisplaying results.

    level(#) specifies the significance level, in percent, for confidence
        intervals of the parameters; see help level.

    nocons specifies no intercept term, i.e. b0 = 0.

    nolog suppresses the iteration logs.

    nobeta0 suppresses reporting of the estimates from Model 1.

    maximize_options control the maximization process. The options available
        are those available for ml d0 as shown by help maximize, with the
        exception of from().  For difficult maximization problems, using the
        difficult or technique options may help convergence.  So too might
        different starting values for the frailty parameters.


Saved results

    In addition to the usual results saved after ml, hshaz also saves the
    following:

    e(m2) and e(se_m2), e(m3) and e(se_m3), e(m4) and e(se_m4), e(m5) and
    e(se_m5), are the estimated m_z and their corresponding standard errors.
    e(m1) = 0.

    e(pr1) and e(se_pr1), e(pr2) and e(se_pr2), e(pr3) and e(se_pr3), e(pr4)
    and e(se_pr4), e(pr5) and e(se_pr5), are the estimated probabilities of
    each type and their corresponding standard errors. e(pr1) = 1 - SUM(z=2
    to z=Z)[e(prz)].

    e(ll_nofr) is the log-likelihood value from Model 1.

    e(depvar), e(deadvar), and e(seqvar) contain the names of depvar,
    deadvar, and seqvar.

    e(N_spell) is the number of spells in the data. (Cf. e(N), the total
    number of `person-period' observations in the estimation data set.)


Examples

    . sysuse cancer, clear

    . ge id = _n

    . recode drug 1=0 2/3=1

    . expand studytime

    . bysort id: ge t = _n // 'survival time', t, is # periods at risk per
        subject

    . bysort id: ge dead = died & _n==_N // event indicator

    . // duration dependence: log(time)

    . ge logt = ln(t)

    . // two mass points

    . hshaz drug age logt, id(id) seq(t) d(dead)

    . hshaz, eform

    . // equivalent model estimated using -gllamm-

    . gllamm dead age drug logt , i(id) ip(fn) nip(2) f(bin) l(cll) allc
        nocons

    . // three mass points

    . hshaz drug age logt, id(id) seq(t) d(dead) nmp(3) difficult tech(dfp)


Author

    Stephen P. Jenkins <stephenj@essex.ac.uk>, Institute for Social and
    Economic Research, University of Essex, Colchester CO4 3SQ, U.K.

Acknowledgements

    Adrienne tenCate found a bug in the handling of fweighted data, and Jeff
    Pitblado showed me how to fix it. Arne Uhlendorff and Peter Haan found a
    bug in the calculation of the SEs for the estimated probabilities of the
    various Types.

References

    Allison, P.D. (1982). Discrete-time methods for the analysis of event
        histories. In Sociological Methodology 1982, ed. S. Leinhardt, San
        Francisco: Jossey-Bass Publishers, 61-97.

    Heckman, J.J. and Singer, B. (1984). A Method for minimizing the impact
        of distributional assumptions in econometric models for duration
        data, Econometrica, 52 (2): 271-320.

    Jenkins, S.P. (1995). Easy estimation methods for discrete-time duration
        models. Oxford Bulletin of Economics and Statistics 57 (1): 129-138.

    Prentice, R. and Gloeckler L. (1978). Regression analysis of grouped
        survival data with application to breast cancer data.  Biometrics 34
        (1): 57-67.


Also see

    Help for stcox, glm, cloglog, xtcloglog, expand, and, if installed, 
             pgmhaz, pgmhaz8.