help boxtidPatrick Royston -------------------------------------------------------------------------------

Title

boxtid-- Box-Tidwell and exponential regression models

Syntax

boxtidregression_cmdyvarxvarlist[weight] [ifexp] [inrange] [,center(cen_list)df(df_list)dfdefault(#)expon(varlist)init(init_list)iter(#)ltolerance(#)tracezero(varlist)regression_cmd_options]where

regression_cmdmay be clogit, glm, logistic, logit, poisson, probit, regress, stcox, or streg.

boxtidshares the features of all estimation commands; see help estcom.All weight types supported by

regression_cmdare allowed; see help weights. Also, factor variables are permitted inxvarlist.Note that

xfracplotandxfracpredmay be used afterboxtidto plot and predict fitted values, respectively. The syntax forxfracplotandxfracpredis the same as forfracplotandfracpred; see help on fracpoly.

Description

boxtidis a generalization of fracpoly in which continuous rather than fractional powers of the continuous covariates are estimated.boxtidfits Box & Tidwell's (1962) power transformation model toyvarwith predictors inxvarlist. The model function for eachxvarinxvarlistisb1 * xvar^p1 + b2 * xvar^p2 ...

boxtidalso fits exponential models for predictors specified inexpon(). The model function for each suchxvarinxvarlistisb1 * exp(p1 * xvar) + b2 * exp(p2 * xvar) ...

The quantities p1, p2, ... are real numbers. After execution,

boxtidleaves variables in the data namedIxv__1,Ixv__2, ..., wherexvrepresents the first four letters of the name ofxvar, the first member ofxvarlist. The new variables contain the best-fitting powers ofxvar(as centered and scaled byboxtid). Also left are variables namedIxv_p1,Ixv_p2, ... which are auxiliary variables (see Remarks). Subsequent members ofxvarlist, if any, also leave behind such variables.

Options

center(cen_list)defines the centering for the covariatesxvar1,xvar2, .... The default iscenter(mean), except for binary covariates where it iscenter(#),#being the lower of the two distinct values of the covariate.cen_listis a comma-separated list with elementsvarlist:{mean|#|no}, except that the first element may optionally be of the form {mean|#|no} to specify the default for all variables. For example,center(no, age:mean)sets the default centering tonoand that foragetomean.

df(df_list)sets up the degrees of freedom (df) for each predictor. The df (not counting the regression constant,_cons) are twice the degree of the Box-Tidwell function, defining a model with m terms to have degree m. For example anxvarfitted as a second-degree Box-Tidwell function has 4 df. The first item in df_list may be either#orvarlist:#. Subsequent items must bevarlist:#. Items are separated by commas andvarlistis specified in the usual way for variables. With the first type of item, the df for all predictors are taken to be#. With the second type of item, all members ofvarlist(which must be a subset ofxvarlist) have#df.The default degrees of freedom for a predictor of type varlist specified in

xvarlistbut not indf_listare assigned according to the number of distinct (unique) values of the predictor, as follows:------------------------------------------- # of distinct values default df ------------------------------------------- 1 (invalid predictor) 2-3 1 4-5 min(2,

dfdefault()) >=6dfdefault()-------------------------------------------Example:

df(4)All variables have 4 df.Example:

df(2, weight displ:4)weightanddisplhave 4 df, all other variables have 2 df.Example:

df(weight displ:4, mpg:2)weightanddisplhave 4 df,mpghas 2 df, all other variables have the default of 1 df.

dfdefault(#)determines the default maximum degrees of freedom (df) for a predictor. Default#is 2 (one power term, one beta).

iter(#)sets#to be the maximum number of iterations allowed for the fitting algorithm to converge. Default: 100.

expon(varlist)specifies that all members of varlist are to be modelled using an exponential function, the default being a power (Box-Tidwell) model. For eachxvar(a member ofvarlist), a multi-exponential model is fitted, namelyb1 * exp(p1 * xvar) + b2 * exp(p2 * xvar) +...

init(init_list)sets initial values for the parameters p1, p2, ... of the model. By default these are calculated automatically. The first item ininit_listmay be either#[#...] orvarlist:#[#...]. Subsequent items must bevarlist:#[#...]. Items are separated by commas andvarlistis specified in the usual way for variables. If the first item is#[#...], this becomes the default initial value for all variables, but subsequent items (re)set the initial value for variables in subsequentvarlists. If the df for a variable in the model is d (greater than 1) then# #... consists of d/2 items. Typically d = 2 so that there is just one initial value,#.

ltolerance(#)is the maximum difference in deviance between iterations required for convergence of the fitting algorithm. Default#: 0.001.

powers(powerlist)defines the powers to be used with fractional polynomial initialization forxvarlist(see Remarks).

tracereports the progress of the fitting procedure towards convergence.

zero(varlist)indicates transformation of negative and zero values of all members ofvarlistto zero before fitting the model (see Remarks).

regression_cmd_optionsare any of the options available withregression_cmd.

Remarks

boxtidfinds and reports a multiple regression model comprising the maximum likelihood estimate of p1, p2, ... for each member ofxvarlist. The model that is fit depends on the type ofregression_cmdthat is used.The fitting procedure is iterative and requires accurate starting values for the powers p1, p2, ...

boxtidfinds initial values for the p's by fitting a fractional polynomial of the appropriate degree for each xvar in turn, with the remaining xvars treated as linear. This procedure greatly reduces the amount of iteration needed subsequently to obtain maximum likelihood estimates of the p's.The table of output includes for each member of

xvarlista test of whether the relation is linear. That is, it reports a quantity calledNonlin. dev., the difference in deviance between the continuous-power model for an xvar and a model linear in xvar, adjusting for other variables in the model. A P-value from a chi-square or F test of the hypothesis of linearity, and the estimated linear coefficient for the xvar, are given.Appropriate estimates of the standard errors of p1, p2, ... are provided in the table of output, and the standard errors of the corresponding regression coefficients are correctly estimated. This requires the auxiliary variables ln(xvar) * xvar^p1, ln(xvar) * xvar^p2, ... to be included in the model. The estimated t- or z-values for the coefficients of these terms should be zero to at least 3 decimal places. If they are not zero, then the estimation procedure probably has not converged properly; the value of

#inltolerance()should be reduced below its default value of 0.001, and the model re-fitted.If an xvar has any negative or zero values and neither the

expon()nor thezero()option is used,boxtidbehaves exactly like fracpoly in that it subtracts the minimum of xvar from xvar and adds the rounding (or counting) interval. The interval is defined as the smallest positive difference between the ordered values of xvar. After this change of origin, the minimum value of xvar is guaranteed positive.An example of the

zero()option is in the assessment of the effect of cigarette smoking on the risk of a disease in an epidemiological study. Since non-smokers may be qualitatively different from smokers, the effect of quantity smoked, regarded as a continuous risk factor, may be discontinuous at zero. The risk may be modelled as a constant for the non-smokers and a Box-Tidwell function of the amount smoked for the smokers by including thezero()option and a dummy variable for non-smokers, for example

. gen byte nonsmoker = (num_cigs==0) if ~missing(num_cigs). boxtid logit death num_cigs nonsmoker, zero(num_cigs)Omission of

zero(num_cigs)would causenum_cigsto be transformed before analysis by the addition of a suitable constant, probably 1.Convergence of the algorithm is not guaranteed and may be hard to achieve for models with xvars with 4 or more degrees of freedom. Sometimes a large negative or positive power estimate with an enormous standard error is obtained, a sign that the model may be overparametrized. It is worth trying a lower degree model and noting whether the deviance is significantly reduced (chi-square or F test on 2 df).

Examples

. sysuse auto.dta. boxtid regress mpg weight. boxtid regress mpg weight displ foreign. boxtid regress mpg weight displ foreign, df(weight displ:2, foreign:1). boxtid regress mpg displ weight, expon(weight). boxtid logit foreign mpg, center(no). boxtid glm foreign mpg, family(bin). xfracplot mpg

ReferenceBox GEP, Tidwell PW. 1962. Transformation of the independent variables. Technometrics 4:531-550.

AuthorPatrick Royston, MRC Clinical Trials Unit, London. patrick.royston@ctu.mrc.ac.uk

Also see