```help boxtid                                                     Patrick Royston
-------------------------------------------------------------------------------

Title

boxtid --  Box-Tidwell and exponential regression models

Syntax

boxtid regression_cmd yvar xvarlist [weight] [if exp] [in range] [,
center(cen_list) df(df_list) dfdefault(#) expon(varlist)
init(init_list) iter(#) ltolerance(#) trace zero(varlist)
regression_cmd_options ]

where regression_cmd may be clogit, glm, logistic, logit, poisson,
probit, regress, stcox, or streg.

boxtid shares the features of all estimation commands; see help estcom.

All weight types supported by regression_cmd are allowed; see help
weights. Also, factor variables are permitted in xvarlist.

Note that xfracplot and xfracpred may be used after boxtid to plot and
predict fitted values, respectively.  The syntax for xfracplot and
xfracpred is the same as for fracplot and fracpred; see help on fracpoly.

Description

boxtid is a generalization of fracpoly in which continuous rather than
fractional powers of the continuous covariates are estimated. boxtid fits
Box & Tidwell's (1962) power transformation model to yvar with predictors
in xvarlist. The model function for each xvar in xvarlist is

b1 * xvar^p1 + b2 * xvar^p2 ...

boxtid also fits exponential models for predictors specified in expon().
The model function for each such xvar in xvarlist is

b1 * exp(p1 * xvar) + b2 * exp(p2 * xvar) ...

The quantities p1, p2, ... are real numbers. After execution, boxtid
leaves variables in the data named Ixv__1, Ixv__2, ..., where xv
represents the first four letters of the name of xvar, the first member
of xvarlist. The new variables contain the best-fitting powers of xvar
(as centered and scaled by boxtid). Also left are variables named Ixv_p1,
Ixv_p2, ... which are auxiliary variables (see Remarks). Subsequent
members of xvarlist, if any, also leave behind such variables.

Options

center(cen_list) defines the centering for the covariates xvar1, xvar2,
....  The default is center(mean), except for binary covariates where
it is center(#), # being the lower of the two distinct values of the
covariate. cen_list is a comma-separated list with elements
varlist:{mean|#|no}, except that the first element may optionally be
of the form {mean|#|no} to specify the default for all variables. For
example, center(no, age:mean) sets the default centering to no and
that for age to mean.

df(df_list) sets up the degrees of freedom (df) for each predictor. The
df (not counting the regression constant, _cons) are twice the degree
of the Box-Tidwell function, defining a model with m terms to have
degree m.  For example an xvar fitted as a second-degree Box-Tidwell
function has 4 df.  The first item in df_list may be either # or
varlist:#.  Subsequent items must be varlist:#.  Items are separated
by commas and varlist is specified in the usual way for variables.
With the first type of item, the df for all predictors are taken to
be #.  With the second type of item, all members of varlist (which
must be a subset of xvarlist) have # df.

The default degrees of freedom for a predictor of type varlist
specified in xvarlist but not in df_list are assigned according to
the number of distinct (unique) values of the predictor, as follows:

-------------------------------------------
# of distinct values    default df
-------------------------------------------
1             (invalid predictor)
2-3            1
4-5            min(2, dfdefault())
>=6            dfdefault()
-------------------------------------------

Example:  df(4)
All variables have 4 df.

Example:  df(2, weight displ:4)
weight and displ have 4 df, all other variables have 2 df.

Example:  df(weight displ:4, mpg:2)
weight and displ have 4 df, mpg has 2 df, all other variables have
the default of 1 df.

dfdefault(#) determines the default maximum degrees of freedom (df) for a
predictor. Default # is 2 (one power term, one beta).

iter(#) sets # to be the maximum number of iterations allowed for the
fitting algorithm to converge. Default: 100.

expon(varlist) specifies that all members of varlist are to be modelled
using an exponential function, the default being a power
(Box-Tidwell) model. For each xvar (a member of varlist), a
multi-exponential model is fitted, namely

b1 * exp(p1 * xvar) + b2 * exp(p2 * xvar) +...

init(init_list) sets initial values for the parameters p1, p2, ... of the
model. By default these are calculated automatically.  The first item
in init_list may be either # [# ...] or varlist:# [# ...]. Subsequent
items must be varlist:# [# ...]. Items are separated by commas and
varlist is specified in the usual way for variables. If the first
item is #[# ...], this becomes the default initial value for all
variables, but subsequent items (re)set the initial value for
variables in subsequent varlists. If the df for a variable in the
model is d (greater than 1) then # # ... consists of d/2 items.
Typically d = 2 so that there is just one initial value, #.

ltolerance(#) is the maximum difference in deviance between iterations
required for convergence of the fitting algorithm.  Default #: 0.001.

powers(powerlist) defines the powers to be used with fractional
polynomial initialization for xvarlist (see Remarks).

trace reports the progress of the fitting procedure towards convergence.

zero(varlist) indicates transformation of negative and zero values of all
members of varlist to zero before fitting the model (see Remarks).

regression_cmd_options are any of the options available with
regression_cmd.

Remarks

boxtid finds and reports a multiple regression model comprising the
maximum likelihood estimate of p1, p2, ... for each member of xvarlist.
The model that is fit depends on the type of regression_cmd that is used.

The fitting procedure is iterative and requires accurate starting values
for the powers p1, p2, ... boxtid finds initial values for the p's by
fitting a fractional polynomial of the appropriate degree for each xvar
in turn, with the remaining xvars treated as linear. This procedure
greatly reduces the amount of iteration needed subsequently to obtain
maximum likelihood estimates of the p's.

The table of output includes for each member of xvarlist a test of
whether the relation is linear. That is, it reports a quantity called
Nonlin. dev., the difference in deviance between the continuous-power
model for an xvar and a model linear in xvar, adjusting for other
variables in the model. A P-value from a chi-square or F test of the
hypothesis of linearity, and the estimated linear coefficient for the
xvar, are given.

Appropriate estimates of the standard errors of p1, p2, ... are provided
in the table of output, and the standard errors of the corresponding
regression coefficients are correctly estimated. This requires the
auxiliary variables ln(xvar) * xvar^p1, ln(xvar) * xvar^p2, ... to be
included in the model.  The estimated t- or z-values for the coefficients
of these terms should be zero to at least 3 decimal places. If they are
not zero, then the estimation procedure probably has not converged
properly; the value of # in ltolerance() should be reduced below its
default value of 0.001, and the model re-fitted.

If an xvar has any negative or zero values and neither the expon() nor
the zero() option is used, boxtid behaves exactly like fracpoly in that
it subtracts the minimum of xvar from xvar and adds the rounding (or
counting) interval. The interval is defined as the smallest positive
difference between the ordered values of xvar. After this change of
origin, the minimum value of xvar is guaranteed positive.

An example of the zero() option is in the assessment of the effect of
cigarette smoking on the risk of a disease in an epidemiological study.
Since non-smokers may be qualitatively different from smokers, the effect
of quantity smoked, regarded as a continuous risk factor, may be
discontinuous at zero. The risk may be modelled as a constant for the
non-smokers and a Box-Tidwell function of the amount smoked for the
smokers by including the zero() option and a dummy variable for
non-smokers, for example

. gen byte nonsmoker = (num_cigs==0) if ~missing(num_cigs)
. boxtid logit death num_cigs nonsmoker, zero(num_cigs)

Omission of zero(num_cigs) would cause num_cigs to be transformed before
analysis by the addition of a suitable constant, probably 1.

Convergence of the algorithm is not guaranteed and may be hard to achieve
for models with xvars with 4 or more degrees of freedom. Sometimes a
large negative or positive power estimate with an enormous standard error
is obtained, a sign that the model may be overparametrized. It is worth
trying a lower degree model and noting whether the deviance is
significantly reduced (chi-square or F test on 2 df).

Examples

. sysuse auto.dta
. boxtid regress mpg weight
. boxtid regress mpg weight displ foreign
. boxtid regress mpg weight displ foreign, df(weight displ:2, foreign:1)
. boxtid regress mpg displ weight, expon(weight)
. boxtid logit foreign mpg, center(no)
. boxtid glm foreign mpg, family(bin)
. xfracplot mpg

Reference

Box GEP, Tidwell PW. 1962. Transformation of the independent variables.
Technometrics 4:531-550.

Author

Patrick Royston, MRC Clinical Trials Unit, London.
patrick.royston@ctu.mrc.ac.uk

Also see

```