```.-
help for ^spsurv^
.-

Split population survival ('cure') model (discrete time duration data)
----------------------------------------------------------------------

^spsurv^ depvar varlist [^if^ <exp>] [^in^ <range>]
, ^i^d^(^idvar^)^ ^s^eq^(^seqvar^)^  [^nocons^]
[^clog^log ^cpr^0^(^#^)^ ^ef^orm ^le^vel^(^#^)^ ^mlopts^]

To reset problem-size limits, see help @matsize@.
^spsurv^ works with Stata version 6 or version 7.

Description
-----------

^spsurv^ estimates what economists refer to as split population survival
models (Schmidt and Witte, 1989) and biostatisticians refer to as cure
models (Maller and Zhou, 1996), for the case where survival times are
intrinsically discrete or are recorded in grouped form, e.g. in months.
(Cf. the continuous time lognormal cure model ^lncure^.) Standard
survival models assume that the prob(eventual failure) > 0 for all
individuals. By contrast split population models suppose that a
proportion never fail. ^spsurv^ estimates by ML this proportion together
with the parameters characterising the hazard rate for the remainder of
the population. For the latter, ^spsurv^ estimates a discrete-time
proportional hazard (cloglog) model. It is a form of 'mover-stayer'
model.

Covariates in varlist may include regressors (fixed or time-varying)
and variables summarizing the duration dependence of the hazard rate.
With suitable definition of the latter, models with a fully non-parametric
specification for duration dependence may be estimated; so too may
parametric specifications.

Let F be an indicator of whether a case eventually fails or not, where
F=1 means eventual failure (not 'cured'), and F=0 means never fail
(i.e. the event of interest never occurs = 'cured'). Put another way,
let prob(F=1) = 1-c  ('recidivist' probability) and prob(F=0) = c ('cure'
probability). ^spsurv^ estimates the cure probability, c, rather than
1-c, the recidivist probability.

For those with failure observed during a given time interval,
the contribution to the likelihood is (1-c)*(probability of survival
to end of previous time interval)*(probability of the event in the given
interval). Censored observations consist of those 'cured' plus the non-cured
not yet observed to fail. Hence the contribution to the likelihood from a
censored survival time is c plus (1-c)*(probability of survival to end of the
given time interval).

More precisely, the (log)likelihood contribution for person i with a
survival time of t 'months' is:

lnL_i = d_i.ln[(1-c).(h_it).(S_it-1)] + (1-d_i)ln[c + (1-c).S_it].

where the discrete-time survivor function is

S_it = PRODUCT(j=1 to j=t) { 1-h_ij }

and d_i is a censoring indicator (=1 if failure observed, 0 otherwise).

The discrete-time hazard h_it is assumed to take the cloglog form:

h_it = 1 - exp[-exp(I_it)]
where
I_it = f(t) + b'X_it.

Covariates X_it may be time-varying. The f(t) summarises duration
dependence in the hazard common to each i. An example is f(t) =
log(t) for a discrete time analogue to a continuous-time Weibull
model. A non-parametric baseline hazard could be fit using an
appropriately-defined set of dummy variables, one for each interval
at risk of failure. (This is perhaps the only time that one might
want to use the ^nocons^ option.)

If c = 0, a testable hypothesis, the split population survival model
reduces to the standard discrete-time proportional hazards survival model.
This could be estimated using the command
^cloglog^ dead varlist [^if^ <exp>] [^in^ <range>] , ^options^
applied to data organised exactly as they are for the corresponding ^spsurv^
command. This model is used to derive starting values in ^spsurv^, and its
estimates may be displayed using the ^cloglog^ option.

The likelihood ratio test of whether c=0 is implemented as a boundary-value
test, as described by Gutierrez, Carter and Drukker (2001). (See also Maller
and Zhou, 1996.)  Where c is so small so as to be indistinguishable from zero
(taken here to mean c < 1e-05), the test statistic is set equal to zero and a
p-value of 1 reported.

In principle, prob(cure) could differ between individuals, rather than assumed
fixed and common to individuals as here. One obvious parameterisation would be
to suppose a logistic relationship between characteristics and the cure
probability, i.e. c_i = 1/[1 + exp(-_cons - q'X_i) ], rather than assuming
q=0, as here. In practice, models allowing for such heterogeneity are
difficult to fit.

Important note about data organization and mandatory variables
--------------------------------------------------------------
The data set must be organised before estimation so that, for each person,
there are as many data rows as there are duration intervals at risk of the
event occuring for each person. Given the definitions above, this means
t_i rows for each person i=1,...,N.  This data organisation is closely
related to that required for estimation of Cox regression models with
time-varying covariates. @expand@ is useful for putting the data in this
form: see [R] expand. See also @stsplit@, or the 'data step' discussion
in Jenkins (1995).

^i^d^(^idvar^)^ specifies the variable uniquely identifying each
person, i.

^s^eq^(^seqvar^)^ is the variable uniquely identifying each time
interval at risk for each person. For each i, the variable
is the integer sequence 1,2,...,t_i.

^depvar^ summarizes censoring status during each time
interval at risk.  If d_i = 0, depvar = 0 for all
j = 1,2,...,t_i; if d_i = 1, depvar = 0 for all j =
1,2,...,(t_i)-1, and depvar = 1 for j = t_i.

Options
-------
^cpr^0^(^#^)^ specifies the value for ln[c/(1-c)] which is used as the
starting value in the maximization. The default is -4, i.e. a

^ef^orm reports the coefficients transformed to hazard ratio format,
i.e. exp(b) rather than b. Standard errors and confidence
intervals are similarly transformed.  ^eform^ may be
specified at estimation or when redisplaying results.

^nocons^ specifies no intercept term in the function b'X_it.

^le^vel^(^#^)^ specifies the significance level, in percent, for
confidence intervals of the parameters; see help @level@.

^clog^log  specifies reporting of the estimates of the cloglog survival
time model (i.e. the case assuming c=0; used to derive
starting values).

^mlopts^ specifies other standard ^maximise^ options. E.g. options such
as ^trace^, ^gradient^, etc., might be used to investigate
convergence problems, and ^ltol^, ^tol^, and ^gtol^, might be
used to change convergence tolerances in conjunction with these
checks.

Warning: given the ordered sequence person-interval structure of the
data, the ^if^ and ^in^ options should be used only with great care.

Saved results
-------------

In addition to the standard estimates saved in e(.) by ^ml^, ^spsurv^
also saves:

e(ll_noc)       log-likelihood value from the model with c=0
(the cloglog model cited above)

e(b0)           estimates of coefficients from model with c=0
(the cloglog model cited above)

e(V0)           variance-covariance matrix of coefficient
estimates from model with c=0
(the cloglog model cited above)

e(cpr0)         value of logit(c) used as starting value in estimation

e(curep)        the estimate of c

e(securep)      standard error for the estimate of c

e(chi2_c)       chi bar-squared test statistic from LR test of H0: c=0
versus H1: c>0. This is a 50:50 mixture of chi-sq(0) and
chi-sq(1): see Gutierrez, Carter and Drukker (2001).

Examples
--------

. ^use cancer^
. ^ge id = _n  /* unique person identifier */ ^
. ^expand studytim  /* convert to person-month form */^
. ^stset t dead, id(id) /* NB relationship to st data format */ ^
. ^sort id^
. ^* create depvar and a duration dependence variable ^
. ^quietly by id: ge dead = died & _n==_N^
. ^quietly by id: ge t =_n^
. ^ge logt = log(t)^
. ^* drug = 1 (placebo); drug =2,3 (receives drug). So recode: ^
. ^recode drug 1=0 2/3=1^
. ^lab var drug "1=receives,0=placebo" ^
. ^spsurv dead logt drug age, id(id) seq(t)^
. ^spsurv, eform^
. ^spsurv dead logt drug age, id(id) seq(t) trace cpr0(-10)^

Author
------
Stephen P. Jenkins <stephenj@@essex.ac.uk>
Institute for Social and Economic Research
University of Essex, Colchester CO4 3SQ, U.K.

Advice from Stata Technical Support is gratefully acknowledged.

References
----------

Gutierrez, R.G., Carter, S., and Drukker, D. (2001). "On
boundary-value likelihood-ratio tests", insert sg160,
Stata Technical Bulletin, STB-60, StataCorp, College
Station TX.

Jenkins, S.P. (1995), "Easy estimation methods for discrete-time
duration models", Oxford Bulletin of Economics and Statistics
57: 129-138.

Maller, R.A. and Zhou, X. Survival Analysis with Long Term Survivors,
Wiley series in probability and statistics, John Wiley,
Chichester.

Schmidt, P. and Witte, A. (1989), "Predicting criminal recidivism
using 'split population' survival time models",
Journal of Econometrics 40: 141-159.

Also see
--------
@cox@, @st stcox@, @st streg@, @expand@, @pgmhaz@ (if installed),
@lncure@ (if installed)

```