help for ^spsurv^

Split population survival ('cure') model (discrete time duration data) ----------------------------------------------------------------------

^spsurv^ depvar varlist [^if^ <exp>] [^in^ <range>] , ^i^d^(^idvar^)^ ^s^eq^(^seqvar^)^ [^nocons^] [^clog^log ^cpr^0^(^#^)^ ^ef^orm ^le^vel^(^#^)^ ^mlopts^]

To reset problem-size limits, see help @matsize@. ^spsurv^ works with Stata version 6 or version 7.

Description -----------

^spsurv^ estimates what economists refer to as split population survival models (Schmidt and Witte, 1989) and biostatisticians refer to as cure models (Maller and Zhou, 1996), for the case where survival times are intrinsically discrete or are recorded in grouped form, e.g. in months. (Cf. the continuous time lognormal cure model ^lncure^.) Standard survival models assume that the prob(eventual failure) > 0 for all individuals. By contrast split population models suppose that a proportion never fail. ^spsurv^ estimates by ML this proportion together with the parameters characterising the hazard rate for the remainder of the population. For the latter, ^spsurv^ estimates a discrete-time proportional hazard (cloglog) model. It is a form of 'mover-stayer' model.

Covariates in varlist may include regressors (fixed or time-varying) and variables summarizing the duration dependence of the hazard rate. With suitable definition of the latter, models with a fully non-parametric specification for duration dependence may be estimated; so too may parametric specifications.

Let F be an indicator of whether a case eventually fails or not, where F=1 means eventual failure (not 'cured'), and F=0 means never fail (i.e. the event of interest never occurs = 'cured'). Put another way, let prob(F=1) = 1-c ('recidivist' probability) and prob(F=0) = c ('cure' probability). ^spsurv^ estimates the cure probability, c, rather than 1-c, the recidivist probability.

For those with failure observed during a given time interval, the contribution to the likelihood is (1-c)*(probability of survival to end of previous time interval)*(probability of the event in the given interval). Censored observations consist of those 'cured' plus the non-cured not yet observed to fail. Hence the contribution to the likelihood from a censored survival time is c plus (1-c)*(probability of survival to end of the given time interval).

More precisely, the (log)likelihood contribution for person i with a survival time of t 'months' is:

lnL_i = d_i.ln[(1-c).(h_it).(S_it-1)] + (1-d_i)ln[c + (1-c).S_it].

where the discrete-time survivor function is

S_it = PRODUCT(j=1 to j=t) { 1-h_ij }

and d_i is a censoring indicator (=1 if failure observed, 0 otherwise).

The discrete-time hazard h_it is assumed to take the cloglog form:

h_it = 1 - exp[-exp(I_it)] where I_it = f(t) + b'X_it.

Covariates X_it may be time-varying. The f(t) summarises duration dependence in the hazard common to each i. An example is f(t) = log(t) for a discrete time analogue to a continuous-time Weibull model. A non-parametric baseline hazard could be fit using an appropriately-defined set of dummy variables, one for each interval at risk of failure. (This is perhaps the only time that one might want to use the ^nocons^ option.)

If c = 0, a testable hypothesis, the split population survival model reduces to the standard discrete-time proportional hazards survival model. This could be estimated using the command ^cloglog^ dead varlist [^if^ <exp>] [^in^ <range>] , ^options^ applied to data organised exactly as they are for the corresponding ^spsurv^ command. This model is used to derive starting values in ^spsurv^, and its estimates may be displayed using the ^cloglog^ option.

The likelihood ratio test of whether c=0 is implemented as a boundary-value test, as described by Gutierrez, Carter and Drukker (2001). (See also Maller and Zhou, 1996.) Where c is so small so as to be indistinguishable from zero (taken here to mean c < 1e-05), the test statistic is set equal to zero and a p-value of 1 reported.

In principle, prob(cure) could differ between individuals, rather than assumed fixed and common to individuals as here. One obvious parameterisation would be to suppose a logistic relationship between characteristics and the cure probability, i.e. c_i = 1/[1 + exp(-_cons - q'X_i) ], rather than assuming q=0, as here. In practice, models allowing for such heterogeneity are difficult to fit.

Important note about data organization and mandatory variables -------------------------------------------------------------- The data set must be organised before estimation so that, for each person, there are as many data rows as there are duration intervals at risk of the event occuring for each person. Given the definitions above, this means t_i rows for each person i=1,...,N. This data organisation is closely related to that required for estimation of Cox regression models with time-varying covariates. @expand@ is useful for putting the data in this form: see [R] expand. See also @stsplit@, or the 'data step' discussion in Jenkins (1995).

^i^d^(^idvar^)^ specifies the variable uniquely identifying each person, i.

^s^eq^(^seqvar^)^ is the variable uniquely identifying each time interval at risk for each person. For each i, the variable is the integer sequence 1,2,...,t_i.

^depvar^ summarizes censoring status during each time interval at risk. If d_i = 0, depvar = 0 for all j = 1,2,...,t_i; if d_i = 1, depvar = 0 for all j = 1,2,...,(t_i)-1, and depvar = 1 for j = t_i. Options ------- ^cpr^0^(^#^)^ specifies the value for ln[c/(1-c)] which is used as the starting value in the maximization. The default is -4, i.e. a cure probability of about 0.018. ^ef^orm reports the coefficients transformed to hazard ratio format, i.e. exp(b) rather than b. Standard errors and confidence intervals are similarly transformed. ^eform^ may be specified at estimation or when redisplaying results.

^nocons^ specifies no intercept term in the function b'X_it.

^le^vel^(^#^)^ specifies the significance level, in percent, for confidence intervals of the parameters; see help @level@.

^clog^log specifies reporting of the estimates of the cloglog survival time model (i.e. the case assuming c=0; used to derive starting values).

^mlopts^ specifies other standard ^maximise^ options. E.g. options such as ^trace^, ^gradient^, etc., might be used to investigate convergence problems, and ^ltol^, ^tol^, and ^gtol^, might be used to change convergence tolerances in conjunction with these checks.

Warning: given the ordered sequence person-interval structure of the data, the ^if^ and ^in^ options should be used only with great care.

Saved results -------------

In addition to the standard estimates saved in e(.) by ^ml^, ^spsurv^ also saves:

e(ll_noc) log-likelihood value from the model with c=0 (the cloglog model cited above)

e(b0) estimates of coefficients from model with c=0 (the cloglog model cited above)

e(V0) variance-covariance matrix of coefficient estimates from model with c=0 (the cloglog model cited above)

e(cpr0) value of logit(c) used as starting value in estimation

e(curep) the estimate of c

e(securep) standard error for the estimate of c

e(chi2_c) chi bar-squared test statistic from LR test of H0: c=0 versus H1: c>0. This is a 50:50 mixture of chi-sq(0) and chi-sq(1): see Gutierrez, Carter and Drukker (2001).

Examples --------

. ^use cancer^ . ^ge id = _n /* unique person identifier */ ^ . ^expand studytim /* convert to person-month form */^ . ^stset t dead, id(id) /* NB relationship to st data format */ ^ . ^sort id^ . ^* create depvar and a duration dependence variable ^ . ^quietly by id: ge dead = died & _n==_N^ . ^quietly by id: ge t =_n^ . ^ge logt = log(t)^ . ^* drug = 1 (placebo); drug =2,3 (receives drug). So recode: ^ . ^recode drug 1=0 2/3=1^ . ^lab var drug "1=receives,0=placebo" ^ . ^spsurv dead logt drug age, id(id) seq(t)^ . ^spsurv, eform^ . ^spsurv dead logt drug age, id(id) seq(t) trace cpr0(-10)^

Author ------ Stephen P. Jenkins <stephenj@@essex.ac.uk> Institute for Social and Economic Research University of Essex, Colchester CO4 3SQ, U.K.

Advice from Stata Technical Support is gratefully acknowledged.

References ----------

Gutierrez, R.G., Carter, S., and Drukker, D. (2001). "On boundary-value likelihood-ratio tests", insert sg160, Stata Technical Bulletin, STB-60, StataCorp, College Station TX.

Jenkins, S.P. (1995), "Easy estimation methods for discrete-time duration models", Oxford Bulletin of Economics and Statistics 57: 129-138.

Maller, R.A. and Zhou, X. Survival Analysis with Long Term Survivors, Wiley series in probability and statistics, John Wiley, Chichester.

Schmidt, P. and Witte, A. (1989), "Predicting criminal recidivism using 'split population' survival time models", Journal of Econometrics 40: 141-159.

Also see -------- @cox@, @st stcox@, @st streg@, @expand@, @pgmhaz@ (if installed), @lncure@ (if installed)