Homoskedastic adjustment inflation factors for model selection
haif [ corevarlist ] [if] [in] [weight] , addvars(varlist) [ noconstant ]
haifcomp [ corevarlist ] [if] [in] [weight] , daddvars(varlist) naddvars(varlist) [ noconstant ]
where corevarlist is a varlist (possibly empty).
aweights, fweights, and iweights are allowed; see help for weight.
corevarlist may contain factor variables; see fvvarlist.
Description
haif calculates homoskedastic adjustment inflation factors (HAIFs) for core variables in the corevarlist, caused by adjustment by the additional variables specified by addvars(). HAIFs are calculated for the variances and standard errors of estimated linear regression parameters corresponding to the core variables. For each variance (or standard error), the HAIF is defined as the ratio between that variance (or standard error) of that parameter, in a model containing both the core variables and the additional variables, to the corresponding variance (or standard error) of the same parameter, in a model containing only the core variables, calculated assuming that the second model is true, and also assuming that the outcome variable is homoskedastic (meaning that it has equal variances in all subpopulations defined by the predictor variables). haifcomp calculates the ratios between the HAIFs for the same core variables caused by adjustment for two alternative lists of additional variables, namely a numerator list and a denominator list. haif and haifcomp are intended for use in model selection, allowing the user to choose a model based on the joint distribution of the exposures and confounders, before estimating the parameters of the model from the data on the outcome variable.
Options for haif and haifcomp
noconstant specifies that the models being compared contain no constant term. If noconstant is not specified, then it is assumed that the models being compared contain a constant term, labelled _cons, and HAIFs (or HAIF ratios) are calculated for the variance and standard error of that constant term.
Options for haif only
addvars(varlist) specifies a list of additional variables, which must not contain any of the core variables. The HAIFs will then be scale factors by which the variances and standard errors of the parameters of the core variables are scaled by including in the model the additional variables specified by addvars(), assuming that these additional variables do not really have any effect, and that the outcome variable is homoskedastic. Note that the variable list specified by addvars() may contain factor variables.
Options for haifcomp only
daddvars(varlist) specifies a list of additional variables, known as the denominator list, which must not contain any of the core variables. The HAIFs for the core variables, caused by adjustment for these additional variables, will then be defined as for the addvars() option of haif, and will be the denominators of the HAIF ratios. Note that the variable list specified by daddvars() may contain factor variables.
naddvars(varlist) specifies a second list of additional variables, known as the numerator list, which also must not contain any of the core variables, although it may contain variables in common with the list specified by daddvars(). The HAIFs for the core variables, caused by adjustment for this second list of additional variables, will then be defined as for the addvars() option of haif, and will be the numerators of the HAIF ratios. Note that the variable list specified by naddvars() may contain factor variables.
Remarks
Homoskedastic adjustment inflation factors (or HAIFs) measure the loss of power to measure the effects of core predictors on an outcome, caused by the inclusion in the model of unnecessary additional predictors. If these predictors are indeed unnecessary, and the true model is a homoskedastic (or equal-variance) linear regression model including only the core predictors, then it can be shown that the population variances and standard errors of the estimated core variable effects will be no smaller if the unneccessary variables are included than if they are not included. (See Subsections 3.7 and 5.4 of Seber, 1977.) The variance HAIFs (and standard error HAIFs) are the scale factors by which these variances (and standard errors) are scaled up by the inclusion of the unnecessary additional variables. The standard error HAIF is interpreted as the factor by which the confidence interval width for a core variable coefficient is scaled up by adjusting for the unnecessary additional variables. The variance HAIF is interpreted as the factor by which the experimenter would have to scale up the size of the experiment, in order to counteract the effect on the confidence interval width of adjusting for the unnecessary additional variables.
Note that, if the additional variables are not unnecessary, then including them in the model will not necessarily increase the variance of the coefficients of the core variables. If the additional variables predict the outcome well, given each value of the core variables, then including the additional variables may even decrease the variance of the coefficients of the core variables. The HAIFs therefore represent a "worst case" scenario, based on the values of the core and additional predictor variables, assuming that we have no knowledge of the distribution of the outcome variable.
Methods and formulas
Let X denote the matrix whose columns are the core variables, let A denote the matrix whose columns are the additional variables specified by the addvars() option of haif, and let D denote the diagonal matrix of weights. The variance HAIF of the kth variable in X is then a ratio, whose numerator is the kth diagonal entry in the matrix
inverse( (X,A)' * D * (X,A) )
and whose denominator is the kth diagonal entry in the matrix
inverse( X' * D * X )
The standard error (SE) HAIF is the square root of the corresponding variance HAIF.
haifcomp inputs two alternative lists of additional variables. Let B denote the matrix of additional variables specified by the daddvars() option, and let C denote the matrix of additional variables specified by the naddvars() option. Then the variance HAIF ratio for the kth variable in X is then a ratio, whose numerator is the kth diagonal entry in the matrix
inverse( (X,C)' * D * (X,C) )
and whose denominator is the kth diagonal entry in the matrix
inverse( (X,B)' * D * (X,B) )
and the SE HAIF ratio is the square root of the corresponding variance HAIF ratio. The HAIF ratios produced by haifcomp are especially useful if the columns of B are linearly dependent on the columns of C, implying that the model with design matrix X,B is a sub-model of the model with design matrix X,C. For example, X might have a single column which is an interesting exposure variable whose effect we wish to know, B might have a single column whose entries are all 1, and C might have multiple columns, containing indicators of the membership of a row (or observation} in each of a set of multiple mutually exclusive strata. The HAIFs will then measure the effect, on the variance and standard error of the slope for X, of fitting a multiple-intercept model (with a separate intercept for each stratum) instead of a single-intercept model (with one common intercept for all strata), assuming that the single-intercept model is true and that the outcome is homoskedastic.
Note that the rows of the matrices correspond to observations with non-missing values for all variables in all varlists input to haif or haifcomp. Therefore, missing values are deleted listwise.
Examples
.sysuse auto, clear .haif weight, add(foreign length) .haif weight foreign, add(length headroom) .haif weight, add(foreign)
The following example demonstrates the use of haifcomp in measuring the effect of fitting an unnecessary 2-intercept model in weight, with separate intercepts for US car models and non-US car models, when a single-intercept model in weight is true, and the outcome variable is homoskedastic. Note the use of factor variables in the naddvars() option.
.sysuse auto, clear .gene byte baseline=1 .haifcomp weight, noc dadd(baseline) nadd(ibn.foreign)
Saved results
haif and haifcomp save the following in r():
Scalars r(N) number of observations
Matrices r(haif) Variance and standard error HAIFs or HAIF ratios
The matrix r(haif) has 1 row for each variable in the list of core variables, and also an additional row for the constant term, if noconstant is not specified. It has 2 columns, the first containing variance HAIFs (or HAIF ratios), and the second containing standard eror (SE) HAIFs (or HAIF ratios). This matrix is also listed to the Stata log, unless the user specifies the quietly prefix.
haif also saves the following in r():
Macros r(addvars) varlist specified by addvars()
haifcomp also saves the following in r():
Macros r(daddvars) varlist specified by daddvars() r(naddvars) varlist specified by naddvars()
Note that, if the variable lists specified by the addvars(), daddvars() and naddvars() are factor variable lists, then the saved variable lists r(addvars), r(daddvars) and r(naddvars) will contain the corresponding expanded and specific factor variable lists.
Author
Roger Newson, National Heart and Lung Institute, Imperial College London, UK. Email: r.newson@imperial.ac.uk
References
Seber, G. A. F. Linear Regression Analysis. New York, NY: John Wiley & Sons; 1977.
Also see
Manual: [R] regress; [U] 11.4.3 Factor variables; [P] fvexpand
Help: [R] regress, [U] fvvarlist, [P] fvexpand