help for gformula                                                  Rhian Daniel
-------------------------------------------------------------------------------

G-computation formula routine for estimating causal effects in the presence of time-varying confounding or mediation

Syntax

gformula mainvarlist [if] [in] [, time-varying_options mediation_options]

options Description ------------------------------------------------------------------------- gformula time-varying_options

outcome(varname) the outcome variable commands(string) the commands to be used for model-fitting for each variable to be simulated equations(string) the RHS of the equations for model-fitting for each variable to be simulated idvar(varname) the subject id tvar(varname) the time variable varyingcovariates(varlist) the time-varying covariates intvars(varlist) the intervention variables interventions(string) the interventions to be compared dynamic for comparing dynamic regimes eofu the outcome is measured at the end of follow-up (as opposed to survival) pooled the model-fitting is to be done by pooling across time-points monotreat the time-varying treatment is binary, and changes at most once (from zero to one) death(varname) the variable containing data on censoring due to death derived(varlist) the derived variables derrules(string) the rules for deriving the derived variables fixedcovariates(varlist) the time-fixed covariates laggedvars(varlist) the lagged variables lagrules(string) the rules for deriving the lagged variables msm(string) the marginal structural model minsim suppresses some unnecessary simulation, reducing MC error impute(varlist) the variables with missing values to be imputed imp_cmd(string) the commands to be used for imputing each of the incomplete variables imp_eq(string) the RHS of the equations to be used for imputing each of the incomplete variables imp_cycles(#) the number of cycles to be used when imputing each of the incomplete variables using chained equations simulations(#) the number of MC simulations samples(#) the number of bootstrap samples seed(#) the random number seed all all bootstrap CIs are to be calculated graph a Kaplan-Meier curve is to be plotted saving(string) the filename under which the MC simulated data are to be stored replace replace the file of the same name if it exists

gformula mediation_options

mediation the analysis is a mediation analysis (as opposed to time-varying confounding) outcome(varname) the outcome variable commands(string) the commands to be used for model-fitting for each variable to be simulated equations(string) the RHS of the equations for model-fitting for each variable to be simulated derived(varlist) the derived variables derrules(string) the rules for deriving the derived variables msm(string) the marginal structural model exposure(varlist) the exposure(s) mediator(varlist) the mediator(s) control(string) the value(s) of the mediator(s) at which the CDE is to be estimated baseline(string) the baseline value(s) of the exposure(s) obe there is only one binary exposure oce there is only one categorical exposure base_confs(varlist) the confounders that are not affected by the exposure(s) post_confs(varlist) the confounders that are affected by the exposure(s) linexp specifies that the exposure is continuous and its effect assumed to be linear minsim suppresses some unnecessary simulation, reducing MC error moreMC specifies that the number of MC simulations will be greater than the sample size impute(varlist) the variables with missing values to be imputed imp_cmd(string) the commands to be used for imputing each of the incomplete variables imp_eq(string) the RHS of the equations to be used for imputing each of the incomplete variables imp_cycles(#) the number of cycles to be used when imputing each of the incomplete variables using chained equations simulations(#) the number of MC simulations samples(#) the number of bootstrap samples seed(#) the random number seed all all bootstrap CIs are to be calculated saving(string) the filename under which the MC simulated data are to be stored replace replace the file of the same name if it exists

-------------------------------------------------------------------------

Description

gformula is an implementation of the g-computation procedure, used to estimate the causal effect of time-varying exposure(s) (A) on an outcome (Y) in the presence of time-varying confounders (L) that are themselves also affected by the exposure(s). The procedure can also be used to address the related problem of estimating controlled direct effects and natural direct/indirect effects when the causal effect of the exposure(s) on an outcome is mediated by intermediate variables, and in particular when confounders of the mediator-outcome relationships are themselves affected by the exposure(s).

+--------------------------+ ----+ Time-varying confounding +-----------------------------------------

The g-computation procedure for time-varying confounding works by first modelling the relationships between the variables seen in the observational data. Using these models, it simulates what would have happened to the subjects in the study had the time-varying exposures been determined by intervention, rather than been allowed to evolve naturally as in the observational data. The modelling and simulation is carried out forward in time. That is, it starts by modelling the time 1 data given the time 0 data, which allows the time 1 data to be simulated under various hypothetical interventions (on the time 0 exposure) to be compared. Then it models the time 2 data given the time 0 and time 1 data in order to simulate the data at time 2 under the various interventions (on time 0 and time 1 exposures), and so on. All post-baseline confounders and outcome(s) are simulated under each intervention.

Causal inference can then be done by comparing the outcome(s) under different interventions as if these had been generated from a randomised experiment. This can either be done by comparing the simulated outcomes directly, or via a specified marginal structural model.

+-----------+ ----+ Mediation +--------------------------------------------------------

A substantively different, yet methodologically highly related, problem arises when we wish to decompose the causal effect of an exposure X on an outcome Y into an indirect effect, acting through a mediator M, and a direct effect not mediated by M. Standard methods cannot be used if there are confounders L of the M-Y relationship that are themselves affected by X.

To discuss precisely what we mean by direct and indirect effects, we use some counterfactual notation. Let Y(x,m) be the potential outcome if, possibly contrary to fact, X were set (by intervention) to x and M were set (by intervention) to m. The controlled direct effect (CDE(m)) is a comparison of E{Y(x,m)} for different values of x, whilst keeping m fixed. For example, if X is univariate and binary, we might specifically consider the controlled direct effect (at m) to be

CDE(m)=E[Y(1,m)]-E[Y(0,m)]

Now let M(x) be the potential value of the mediator if, possibly contrary to fact, X were set to x.

The total causal effect (TCE) is a comparison of E(Y[x,M(x)]) for different values of x. Again, for binary X, we would have

TCE=E(Y[1,M(1)])-E(Y[0,M(0)])

It would be desirable to use these quantities to infer an indirect effect as the difference between the total effect and the direct effect. The fact that the controlled direct effect is a function of m makes this impossible.

For this reason, the natural direct effect (NDE(x0)) is defined to be a comparison of E(Y[x,M(x0)]) for different values of x, keeping x0 fixed (usually at the baseline value of X, if such a natural choice exists). In other words, it is the effect of X on Y, were M to take on its natural value under the baseline intervention. For binary X, we would have

NDE(0)=E(Y[1,M(0)])-E(Y[0,M(0)])

Then the natural indirect effect (NIE(x1)) can be defined as the difference between the total causal effect and the natural direct effect. Then it is a comparison of E(Y[x1,M(x)]) for different values of x, whilst keeping x1 fixed (at a natural choice of non-baseline value). This is best illustrated by thinking again of a binary X, when the natural indirect effect becomes

NIE(1)=E(Y[1,M(1)])-E(Y[1,M(0)])

The g-computation procedure for the mediation example works in a similar way, except that for the natural direct and indirect effects, simulations under different hypothetical interventions need to be combined. Suppose that X is binary, then M is simulated under both X=1 and X=0, giving M(1) and M(0), respectively. To simulate Y[1,M(0)] (needed to estimate the natural direct effect), X is set to 1 at the same time as M is set to the simulated value under the intervention X=0.

If X is not binary and/or if X is multivariate, there may not be a natural comparison (such as 1 vs. 0) for calculating the total causal effect, controlled direct effect or natural direct/indirect effects. In this case, the formulae above are replaced by

CDE(m)=E[Y(X,m)]-E[Y(0,m)]

TCE=E(Y(X,M(X)))-E(Y(0,M(0)))

NDE(0)=E(Y[X,M(0)])-E(Y[0,M(0)])

and

NIE(X)=E(Y[X,M(X)])-E(Y[X,M(0)])

where 0 is still the baseline value(s) of X, but is now compared with the distribution of X arising naturally in the observational data.

+----------------+ ----+ Data structure +---------------------------------------------------

The outcome can be binary or continuous, or (in the time-varying confounding option) time-to-event.

For the time-varying confounding option (as opposed to the mediation option---see below), the data must be in long format (see reshape), i.e. there should be a separate record for each subject at each time-point. If the outcome is time-to-event, the outcome data for each subject should be given as a series of binary variables measured at each time-point. No records should be included in the dataset for subjects who have been censored before that time, due to death or loss to follow-up, or (in the case of a time-to-event outcome) due to having experienced the event before that time.

Any value which is to be imputed (including those at intermittent missing visits, for which a record must be included) should be denoted by a full stop (.), according to Stata's convention.

For the mediation option, there should be exactly one record per subject. Again, missing values to be imputed should be denoted by a full stop.

Options

+---------------------------------------------+ ----+ gformula (time-varying confounding options) +----------------------

outcome(varname) specifies that varname is the outcome variable.

commands(string) specifies which command (either regress, logit, mlogit or ologit) should be used when fitting each of the parametric models. The variable name is followed by a colon (:), which is followed by the command name, with a comma (,) separating the different variables.

Commands should be specified for the models for the outcome variable, time-varying confounders and the time-varying exposure. If there is censoring due to death, then the command used for the model for death should also be specified.

For a survival outcome (and death, if applicable), the command should be chosen as logit, since the outcome (or death) is given (and simulated) as a sequence of binary variables.

equations(string) specifies the right-hand side of the equations used when fitting the models listed above. The name of the dependent variable is followed by a colon, which is followed by the list of independent variables. A comma (,) should separate the equations for the different dependent variables.

Since the data are stored in long format, lagged variables will need to be used (see below) to incorporate the dependence on data from previous visits.

The equation for any particular variable, for example a time-varying confounder L, must be the same at each visit.

The prefix i. should be used for any variable that is to be treated as categorical.

idvar(varname) specifies that varname is the numeric variable identifying the subject.

tvar(varname) specifies that varname is the numeric variable identifying the time-point.

varyingcovariates(varlist) specifies that varlist are the time-varying covariates. If lagged versions of these variables are to be used, only the unlagged versions should be included in this list.

intvars(varlist) specifies that varlist are the variables on which interventions are to be specified. If lagged versions of these variables are to be used, only the unlagged versions should be included in this list.

interventions(string) specifies the exact interventions to be compared. Different interventions should be separated by a comma (,) and different commands within one intervention should be separated by a backwards slash (\).

dynamic specifies that the regimes to be compared are dynamic. If this option is not specified, it is assumed that the regimes to be compared (except for the observational regime) are all static.

eofu specifies that the outcome is measured only at the end of follow-up. If this option is not specified, it is assumed that the outcome is time-to-event.

pooled specifies that the models defined by the command and equation options above (and the imputation models - see below - if relevant) should be fitted to data from all visits at once, pooling across time-points. If this option is not specified, the models are fitted separately at each visit.

monotreat specifies that the time-varying exposure in the observational data is binary, and changes at most once (from zero to one). Thus the exposure data for a given subject consists of a sequence of zeros followed by a sequence of ones (or a sequence containing only zeros, or only ones). This is common in many settings in which treatment may be initiated at some point, but never discontinued. Specifying this option only affects the way in which data are simulated under the observational regime. When using the monotreat option, the corresponding command for simulating the time-varying exposure should be specified as logit.

death(varname) gives the name of the variable (a sequence of binary variables at each time-point) which takes the value 0 if a subject is still alive at that time-point and 1 if a subject died between the previous and current time-points. (No further records following death will be included in the original dataset.)

It is assumed that all censoring (before the final visit) where death=0 is due to loss to follow-up. Simulations are then drawn to mimic a situation in which there are deaths but no losses to follow-up.

If the death option is not specified, all censoring (before the final visit) is assumed to be due to loss to follow-up and simulations are drawn to mimic no losses to follow-up.

derived(varlist) lists all the variables which are to be derived from other variables, such as interactions. Lagged variables themselves should not be included here, but variables derived using one or more lagged variables should be included. The derived variables must exist in the original dataset.

derrules(string) describes how the derived variables are to be obtained from the other variables. For example, if the variable al is to be created as the product of a and l, the code is derrules(al:a*l) (and al should be included in derived(varlist) above). The rules for generating more than one derived variable should be separated using a comma.

fixedcovariates(varlist) lists the time-fixed covariates. These do not depend on the time-varying exposure and thus are not simulated.

laggedvars(varlist) lists the lagged variables. The lagged variables must exist in the original dataset.

lagrules(string) gives further details of the lagged variables. For example, if the variable a_lag is the lagged version of a, and a_lag2 is the double-lagged version of a, this would be denoted as lagrules(a_lag:a 1, a_lag2:a 2).

msm(string) specifies the form of the marginal structural model, for example, msm(regress y a_lag a_lag2) or msm(stcox a_lag a_lag2).

minsim suppresses the simulation of the outcome, and uses just the predicted means instead. This reduces unnecessary MC error in some contexts. Note: categorical outcomes fitted using ologit or mlogit are still simulated even with minsim.

impute(varlist) gives a list of the variables that contain missing values to be imputed via the method of single stochastic imputation using chained equations.

imp_cmd(string) specifies which command (either regress, logit, mlogit or ologit) should be used when fitting each of the imputation models. The syntax is the same as for the commands option described above.

imp_eq(string) specifies the RHS of each of the equations to be used for fitting each of the imputation models. The syntax is the same as for the equations option described above.

imp_cycles(#) specifies the number of cycles of chained equations to be used in the imputation procedure. The default is 10.

simulations(#) specifies the size of the Monte Carlo simulated dataset. The default is the same size as the observed dataset, but for computational reasons, it can be smaller.

samples(#) specifies the number of bootstrap samples. The default is 1,000.

seed(#) sets the random-number seed to #.

all specifies that all bootstrap confidence intervals are to be displayed (normal, percentile, bias corrected, and bias corrected and accelerated). The default is to give normal-based bootstrap confidence intervals only. See bootstrap.

graph specifies that a Kaplan-Meier plot of the survival curves under each intervention be displayed. This option is only relevant for a time-varying confounding analysis with a time-to-event outcome.

saving(string) saves the dataset containing the original observational data and all the Monte Carlo simulations in a Stata dataset named string. The dataset contains a variable _int which takes the value 0 for the observational data, the value 1 for the simulations corresponding to intervention 1, and so on for each of the n specified interventions. Finally, the Monte Carlo simulations under the observational regime appear at the end of the dataset, with _int taking the value n+1.

replace specifies that if the .dta file given in the option saving(string) already exists, it should be overwritten.

+------------------------------+ ----+ gformula (mediation options) +-------------------------------------

mediation specifies that the analysis is a mediation analysis. If this option is not specified, then a time-varying confounding analysis is assumed.

outcome(varname) specifies that varname is the outcome variable.

commands(string) specifies which command (either regress, logit, mlogit or ologit) should be used when fitting the simulation models. The variable name is followed by a colon (:), which is followed by the command name, with a comma (,) separating the different variables.

Models must be specified for the mediator(s), outcome, and the post-baseline confounders of the mediator-outcome relationship that are affected by the exposure.

equations(string) specifies the right-hand side of the equations used when fitting the models listed above. The name of the dependent variable is followed by a colon (:), which is followed by the list of independent variables. A comma (,) should separate the equations for the different dependent variables.

The prefix i. should be used for any variable that is to be treated as categorical.

derived(varlist) lists all the variables which are to be derived from other variables, such as interactions.

derrules(string) describes how the derived variables are to be obtained from the other variables. For example, if the variable al is to be created as the product of a and l, the code is derrules(al:a*l) (and al should be included in derived(varlist) above). The rules for generating more than one derived variable should be separated using a comma.

msm(string) specifies the form of the marginal structural model, for example, msm(regress y x m) or msm(logit y x m xm).

exposure(varlist) specifies the exposure variable(s).

mediator(varlist) specifies the mediator variable(s).

control(string) specifies the value(s) at which the mediator(s) should be controlled for the controlled direct effect. If this option is not specified, only natural direct/indirect effects are estimated.

obe specifies that there is only one binary exposure, and that the comparisons should be made between X=1 and X=0. If neither this nor oce (see next option) is specified, comparisons are made between the distribution of X in the observed data, and the baseline value(s).

oce specifies that there is only one categorical exposure, and that the comparisons should be made between each non-baseline level of X and the baseline level, as specified using the baseline option above. If neither this nor obe is specifid, comparisons are made between the distribution of X in the observed data, and the baseline value(s).

baseline(string) specifies the value(s) of the exposure(s) to be taken as baseline value(s).

base_confs(varlist) specifies the confounder(s) of the exposure-outcome relationship(s).

post_confs(varlist) specifies the confounder(s) of the mediator-outcome relationship(s).

linexp specifies that the exposure is continuous and that its effect is assumed to be linear. Thus, the CDE is defined as E{Y(X+1,m)-Y(X,m)}, and so on.

minsim suppresses the simulation of the outcome, and uses just the predicted means instead. This reduces unnecessary MC error in some contexts. Note: categorical outcomes fitted using ologit or mlogit are still simulated even with minsim. In addition, if the option linexp has been specified, even fewer outcomes need be simulated, and these unnecessary simulations are avoided when this option is specified.

moreMC specifies that the number of MC simulations will be greater than the sample size. This can be useful in reducing MC error.

impute(varlist) gives a list of the variables that contain missing values to be imputed via the method of single stochastic imputation using chained equations.

imp_cmd(string) specifies which command (either regress, logit, mlogit or ologit) should be used when fitting each of the imputation models. The syntax is the same as for the commands option described above.

imp_eq(string) specifies the RHS of each of the equations to be used for fitting each of the imputation models. The syntax is the same as for the equations option described above.

imp_cycles(#) specifies the number of cycles of chained equations to be used in the imputation procedure. The default is 10.

simulations(#) specifies the size of the Monte Carlo simulated dataset. The default is the same size as the observed dataset.

samples(#) specifies the number of bootstrap samples. The default is 1,000.

seed(#) sets the random-number seed to #.

all specifies that all bootstrap confidence intervals are to be displayed (normal, percentile, bias corrected, and bias corrected and accelerated). The default is to give normal-based bootstrap confidence intervals only. See bootstrap.

saving(string) saves the dataset containing the original observational data and all the Monte Carlo simulations in a Stata dataset named string.

replace specifies that if the .dta file given in the option saving(string) already exists, it should be overwritten.

Authors

Rhian Daniel, Bianca De Stavola and Simon Cousens. Centre for Statistical Methodology, London School of Hygiene and Tropical Medicine. Rhian.Daniel@LSHTM.ac.uk

Further reading

Daniel, R. M., De Stavola, B. L., and Cousens, S. N. 2011. gformula: Estimating causal effects in the presence of time-varying confounding or mediation using the g-computation formula. The Stata Journal 11(4):479-517.

Robins, J. M. 1986. A new approach to causal inference in mortality studies with a sustained exposure period - application to control of the healthy worker survivor effect. Mathematical Modelling 7:1393-1512.

Robins, J. M., and Hernan, M. A. 2009. Longitudinal Data Analysis, chap. 23: Estimation of the causal effects of time-varying exposures, 553-599. New York: Chapman and Hall / CRC Press.

Pearl, J. 2001. Direct and Indirect Effects. In Proceedings of the 17th Conference on Uncertainty in Artificial Intelligence. 411-420.

Didelez, V. 2006. Direct and Indirect Effects of Sequential Treatments. In Proceedings of the 22nd Conference on Uncertainty in Artificial Intelligence. 138-146.

Acknowledgements

gformula makes use of the detangle and formatlist procedures from ice, by kind permission of Patrick Royston.

This work was supported by the Medical Research Council, UK (Grant number: G0701024).

We are very grateful to Daniela Zugna, Deborah Ford and Linda Harrison for the suggestions they have made to improve this command.