.-
help for ^spsurv^
.-

Split population survival ('cure') model (discrete time duration data)
----------------------------------------------------------------------

^spsurv^ depvar varlist [^if^ <exp>] [^in^ <range>]
        , ^i^d^(^idvar^)^ ^s^eq^(^seqvar^)^  [^nocons^] 
        [^clog^log ^cpr^0^(^#^)^ ^ef^orm ^le^vel^(^#^)^ ^mlopts^]

To reset problem-size limits, see help @matsize@.
^spsurv^ works with Stata version 6 or version 7.

Description
-----------

^spsurv^ estimates what economists refer to as split population survival
models (Schmidt and Witte, 1989) and biostatisticians refer to as cure
models (Maller and Zhou, 1996), for the case where survival times are
intrinsically discrete or are recorded in grouped form, e.g. in months.
(Cf. the continuous time lognormal cure model ^lncure^.) Standard
survival models assume that the prob(eventual failure) > 0 for all
individuals. By contrast split population models suppose that a
proportion never fail. ^spsurv^ estimates by ML this proportion together
with the parameters characterising the hazard rate for the remainder of
the population. For the latter, ^spsurv^ estimates a discrete-time
proportional hazard (cloglog) model. It is a form of 'mover-stayer'
model.

Covariates in varlist may include regressors (fixed or time-varying)
and variables summarizing the duration dependence of the hazard rate.  
With suitable definition of the latter, models with a fully non-parametric 
specification for duration dependence may be estimated; so too may 
parametric specifications. 

Let F be an indicator of whether a case eventually fails or not, where
F=1 means eventual failure (not 'cured'), and F=0 means never fail 
(i.e. the event of interest never occurs = 'cured'). Put another way,
let prob(F=1) = 1-c  ('recidivist' probability) and prob(F=0) = c ('cure' 
probability). ^spsurv^ estimates the cure probability, c, rather than 
1-c, the recidivist probability.

For those with failure observed during a given time interval, 
the contribution to the likelihood is (1-c)*(probability of survival 
to end of previous time interval)*(probability of the event in the given 
interval). Censored observations consist of those 'cured' plus the non-cured
not yet observed to fail. Hence the contribution to the likelihood from a 
censored survival time is c plus (1-c)*(probability of survival to end of the 
given time interval). 

More precisely, the (log)likelihood contribution for person i with a 
survival time of t 'months' is:

        lnL_i = d_i.ln[(1-c).(h_it).(S_it-1)] + (1-d_i)ln[c + (1-c).S_it].

where the discrete-time survivor function is

        S_it = PRODUCT(j=1 to j=t) { 1-h_ij }

and d_i is a censoring indicator (=1 if failure observed, 0 otherwise).

The discrete-time hazard h_it is assumed to take the cloglog form: 

        h_it = 1 - exp[-exp(I_it)]  
where 
        I_it = f(t) + b'X_it.

Covariates X_it may be time-varying. The f(t) summarises duration
dependence in the hazard common to each i. An example is f(t) =
log(t) for a discrete time analogue to a continuous-time Weibull
model. A non-parametric baseline hazard could be fit using an
appropriately-defined set of dummy variables, one for each interval
at risk of failure. (This is perhaps the only time that one might 
want to use the ^nocons^ option.)

If c = 0, a testable hypothesis, the split population survival model
reduces to the standard discrete-time proportional hazards survival model. 
This could be estimated using the command
        ^cloglog^ dead varlist [^if^ <exp>] [^in^ <range>] , ^options^
applied to data organised exactly as they are for the corresponding ^spsurv^
command. This model is used to derive starting values in ^spsurv^, and its
estimates may be displayed using the ^cloglog^ option.

The likelihood ratio test of whether c=0 is implemented as a boundary-value 
test, as described by Gutierrez, Carter and Drukker (2001). (See also Maller
and Zhou, 1996.)  Where c is so small so as to be indistinguishable from zero
(taken here to mean c < 1e-05), the test statistic is set equal to zero and a 
p-value of 1 reported.

In principle, prob(cure) could differ between individuals, rather than assumed
fixed and common to individuals as here. One obvious parameterisation would be 
to suppose a logistic relationship between characteristics and the cure 
probability, i.e. c_i = 1/[1 + exp(-_cons - q'X_i) ], rather than assuming 
q=0, as here. In practice, models allowing for such heterogeneity are 
difficult to fit.


Important note about data organization and mandatory variables
--------------------------------------------------------------
The data set must be organised before estimation so that, for each person, 
there are as many data rows as there are duration intervals at risk of the 
event occuring for each person. Given the definitions above, this means 
t_i rows for each person i=1,...,N.  This data organisation is closely 
related to that required for estimation of Cox regression models with 
time-varying covariates. @expand@ is useful for putting the data in this 
form: see [R] expand. See also @stsplit@, or the 'data step' discussion 
in Jenkins (1995).

^i^d^(^idvar^)^ specifies the variable uniquely identifying each 
        person, i.

^s^eq^(^seqvar^)^ is the variable uniquely identifying each time 
        interval at risk for each person. For each i, the variable 
        is the integer sequence 1,2,...,t_i. 

^depvar^ summarizes censoring status during each time
        interval at risk.  If d_i = 0, depvar = 0 for all 
        j = 1,2,...,t_i; if d_i = 1, depvar = 0 for all j = 
        1,2,...,(t_i)-1, and depvar = 1 for j = t_i.
        
Options
-------
^cpr^0^(^#^)^ specifies the value for ln[c/(1-c)] which is used as the 
        starting value in the maximization. The default is -4, i.e. a 
        cure probability of about 0.018.
        
^ef^orm reports the coefficients transformed to hazard ratio format,
        i.e. exp(b) rather than b. Standard errors and confidence 
        intervals are similarly transformed.  ^eform^ may be
        specified at estimation or when redisplaying results.

^nocons^ specifies no intercept term in the function b'X_it.

^le^vel^(^#^)^ specifies the significance level, in percent, for
        confidence intervals of the parameters; see help @level@.

^clog^log  specifies reporting of the estimates of the cloglog survival 
        time model (i.e. the case assuming c=0; used to derive
        starting values).

^mlopts^ specifies other standard ^maximise^ options. E.g. options such
        as ^trace^, ^gradient^, etc., might be used to investigate 
        convergence problems, and ^ltol^, ^tol^, and ^gtol^, might be 
        used to change convergence tolerances in conjunction with these
        checks.

Warning: given the ordered sequence person-interval structure of the
data, the ^if^ and ^in^ options should be used only with great care.  


Saved results
-------------

In addition to the standard estimates saved in e(.) by ^ml^, ^spsurv^
also saves:

e(ll_noc)       log-likelihood value from the model with c=0 
                (the cloglog model cited above)

e(b0)           estimates of coefficients from model with c=0
                (the cloglog model cited above)

e(V0)           variance-covariance matrix of coefficient 
                estimates from model with c=0
                (the cloglog model cited above)

e(cpr0)         value of logit(c) used as starting value in estimation

e(curep)        the estimate of c

e(securep)      standard error for the estimate of c

e(chi2_c)       chi bar-squared test statistic from LR test of H0: c=0 
                versus H1: c>0. This is a 50:50 mixture of chi-sq(0) and 
                chi-sq(1): see Gutierrez, Carter and Drukker (2001).


Examples
--------

. ^use cancer^
. ^ge id = _n  /* unique person identifier */ ^
. ^expand studytim  /* convert to person-month form */^
. ^stset t dead, id(id) /* NB relationship to st data format */ ^
. ^sort id^
. ^* create depvar and a duration dependence variable ^
. ^quietly by id: ge dead = died & _n==_N^
. ^quietly by id: ge t =_n^
. ^ge logt = log(t)^
. ^* drug = 1 (placebo); drug =2,3 (receives drug). So recode: ^
. ^recode drug 1=0 2/3=1^
. ^lab var drug "1=receives,0=placebo" ^
. ^spsurv dead logt drug age, id(id) seq(t)^
. ^spsurv, eform^
. ^spsurv dead logt drug age, id(id) seq(t) trace cpr0(-10)^


Author
------
Stephen P. Jenkins <stephenj@@essex.ac.uk>
Institute for Social and Economic Research 
University of Essex, Colchester CO4 3SQ, U.K.

Advice from Stata Technical Support is gratefully acknowledged.


References
----------

Gutierrez, R.G., Carter, S., and Drukker, D. (2001). "On 
        boundary-value likelihood-ratio tests", insert sg160, 
        Stata Technical Bulletin, STB-60, StataCorp, College
        Station TX.

Jenkins, S.P. (1995), "Easy estimation methods for discrete-time
        duration models", Oxford Bulletin of Economics and Statistics
        57: 129-138.

Maller, R.A. and Zhou, X. Survival Analysis with Long Term Survivors,
        Wiley series in probability and statistics, John Wiley,
        Chichester.

Schmidt, P. and Witte, A. (1989), "Predicting criminal recidivism 
        using 'split population' survival time models",
        Journal of Econometrics 40: 141-159.

Also see
--------
@cox@, @st stcox@, @st streg@, @expand@, @pgmhaz@ (if installed),
@lncure@ (if installed)