help for gformulaRhian Daniel -------------------------------------------------------------------------------

G-computation formula routine for estimating causal effects in the presence oftime-varying confounding or mediation

Syntax

gformulamainvarlist[if] [in] [,time-varying_optionsmediation_options]

optionsDescription -------------------------------------------------------------------------gformulatime-varying_options

outcome(varname)the outcome variablecommands(string)the commands to be used for model-fitting for each variable to be simulatedequations(string)the RHS of the equations for model-fitting for each variable to be simulatedidvar(varname)the subject idtvar(varname)the time variablevaryingcovariates(varlist)the time-varying covariatesintvars(varlist)the intervention variablesinterventions(string)the interventions to be compareddynamicfor comparing dynamic regimeseofuthe outcome is measured at the end of follow-up (as opposed to survival)pooledthe model-fitting is to be done by pooling across time-pointsmonotreatthe time-varying treatment is binary, and changes at most once (from zero to one)death(varname)the variable containing data on censoring due to deathderived(varlist)the derived variablesderrules(string)the rules for deriving the derived variablesfixedcovariates(varlist)the time-fixed covariateslaggedvars(varlist)the lagged variableslagrules(string)the rules for deriving the lagged variablesmsm(string)the marginal structural modelminsimsuppresses some unnecessary simulation, reducing MC errorimpute(varlist)the variables with missing values to be imputedimp_cmd(string)the commands to be used for imputing each of the incomplete variablesimp_eq(string)the RHS of the equations to be used for imputing each of the incomplete variablesimp_cycles(#)the number of cycles to be used when imputing each of the incomplete variables using chained equationssimulations(#)the number of MC simulationssamples(#)the number of bootstrap samplesseed(#)the random number seedallall bootstrap CIs are to be calculatedgrapha Kaplan-Meier curve is to be plottedsaving(string)the filename under which the MC simulated data are to be storedreplacereplace the file of the same name if it exists

gformulamediation_options

mediationthe analysis is a mediation analysis (as opposed to time-varying confounding)outcome(varname)the outcome variablecommands(string)the commands to be used for model-fitting for each variable to be simulatedequations(string)the RHS of the equations for model-fitting for each variable to be simulatedderived(varlist)the derived variablesderrules(string)the rules for deriving the derived variablesmsm(string)the marginal structural modelexposure(varlist)the exposure(s)mediator(varlist)the mediator(s)control(string)the value(s) of the mediator(s) at which the CDE is to be estimatedbaseline(string)the baseline value(s) of the exposure(s)obethere is only one binary exposureocethere is only one categorical exposurebase_confs(varlist)the confounders that are not affected by the exposure(s)post_confs(varlist)the confounders that are affected by the exposure(s)linexpspecifies that the exposure is continuous and its effect assumed to be linearminsimsuppresses some unnecessary simulation, reducing MC errormoreMCspecifies that the number of MC simulations will be greater than the sample sizeimpute(varlist)the variables with missing values to be imputedimp_cmd(string)the commands to be used for imputing each of the incomplete variablesimp_eq(string)the RHS of the equations to be used for imputing each of the incomplete variablesimp_cycles(#)the number of cycles to be used when imputing each of the incomplete variables using chained equationssimulations(#)the number of MC simulationssamples(#)the number of bootstrap samplesseed(#)the random number seedallall bootstrap CIs are to be calculatedsaving(string)the filename under which the MC simulated data are to be storedreplacereplace the file of the same name if it exists-------------------------------------------------------------------------

Description

gformulais an implementation of the g-computation procedure, used to estimate the causal effect of time-varying exposure(s) (A) on an outcome (Y) in the presence of time-varying confounders (L) that are themselves also affected by the exposure(s). The procedure can also be used to address the related problem of estimating controlled direct effects and natural direct/indirect effects when the causal effect of the exposure(s) on an outcome is mediated by intermediate variables, and in particular when confounders of the mediator-outcome relationships are themselves affected by the exposure(s).+--------------------------+ ----+ Time-varying confounding +-----------------------------------------

The g-computation procedure for time-varying confounding works by first modelling the relationships between the variables seen in the observational data. Using these models, it simulates what would have happened to the subjects in the study had the time-varying exposures been determined by intervention, rather than been allowed to evolve naturally as in the observational data. The modelling and simulation is carried out forward in time. That is, it starts by modelling the time 1 data given the time 0 data, which allows the time 1 data to be simulated under various hypothetical interventions (on the time 0 exposure) to be compared. Then it models the time 2 data given the time 0 and time 1 data in order to simulate the data at time 2 under the various interventions (on time 0 and time 1 exposures), and so on. All post-baseline confounders and outcome(s) are simulated under each intervention.

Causal inference can then be done by comparing the outcome(s) under different interventions as if these had been generated from a randomised experiment. This can either be done by comparing the simulated outcomes directly, or via a specified marginal structural model.

+-----------+ ----+ Mediation +--------------------------------------------------------

A substantively different, yet methodologically highly related, problem arises when we wish to decompose the causal effect of an exposure X on an outcome Y into an indirect effect, acting through a mediator M, and a direct effect not mediated by M. Standard methods cannot be used if there are confounders L of the M-Y relationship that are themselves affected by X.

To discuss precisely what we mean by direct and indirect effects, we use some counterfactual notation. Let Y(x,m) be the potential outcome if, possibly contrary to fact, X were set (by intervention) to x and M were set (by intervention) to m. The

controlled direct effect(CDE(m)) is a comparison of E{Y(x,m)} for different values of x, whilst keeping m fixed. For example, if X is univariate and binary, we might specifically consider the controlled direct effect (at m) to beCDE(m)=E[Y(1,m)]-E[Y(0,m)]

Now let M(x) be the potential value of the mediator if, possibly contrary to fact, X were set to x.

The

total causal effect(TCE) is a comparison of E(Y[x,M(x)]) for different values of x. Again, for binary X, we would haveTCE=E(Y[1,M(1)])-E(Y[0,M(0)])

It would be desirable to use these quantities to infer an indirect effect as the difference between the total effect and the direct effect. The fact that the controlled direct effect is a function of m makes this impossible.

For this reason, the

natural direct effect(NDE(x0)) is defined to be a comparison of E(Y[x,M(x0)]) for different values of x, keeping x0 fixed (usually at the baseline value of X, if such a natural choice exists). In other words, it is the effect of X on Y, were M to take on its natural value under the baseline intervention. For binary X, we would haveNDE(0)=E(Y[1,M(0)])-E(Y[0,M(0)])

Then the

natural indirect effect(NIE(x1)) can be defined as the difference between the total causal effect and the natural direct effect. Then it is a comparison of E(Y[x1,M(x)]) for different values of x, whilst keeping x1 fixed (at a natural choice of non-baseline value). This is best illustrated by thinking again of a binary X, when the natural indirect effect becomesNIE(1)=E(Y[1,M(1)])-E(Y[1,M(0)])

The g-computation procedure for the mediation example works in a similar way, except that for the natural direct and indirect effects, simulations under different hypothetical interventions need to be combined. Suppose that X is binary, then M is simulated under both X=1 and X=0, giving M(1) and M(0), respectively. To simulate Y[1,M(0)] (needed to estimate the natural direct effect), X is set to 1 at the same time as M is set to the simulated value under the intervention X=0.

If X is not binary and/or if X is multivariate, there may not be a natural comparison (such as 1 vs. 0) for calculating the total causal effect, controlled direct effect or natural direct/indirect effects. In this case, the formulae above are replaced by

CDE(m)=E[Y(X,m)]-E[Y(0,m)]

TCE=E(Y(X,M(X)))-E(Y(0,M(0)))

NDE(0)=E(Y[X,M(0)])-E(Y[0,M(0)])

and

NIE(X)=E(Y[X,M(X)])-E(Y[X,M(0)])

where 0 is still the baseline value(s) of X, but is now compared with the distribution of X arising naturally in the observational data.

+----------------+ ----+ Data structure +---------------------------------------------------

The outcome can be binary or continuous, or (in the time-varying confounding option) time-to-event.

For the time-varying confounding option (as opposed to the mediation option---see below), the data must be in long format (see reshape), i.e. there should be a separate record for each subject at each time-point. If the outcome is time-to-event, the outcome data for each subject should be given as a series of binary variables measured at each time-point. No records should be included in the dataset for subjects who have been censored before that time, due to death or loss to follow-up, or (in the case of a time-to-event outcome) due to having experienced the event before that time.

Any value which is to be imputed (including those at intermittent missing visits, for which a record must be included) should be denoted by a full stop (.), according to Stata's convention.

For the mediation option, there should be exactly one record per subject. Again, missing values to be imputed should be denoted by a full stop.

Options+---------------------------------------------+ ----+ gformula (time-varying confounding options) +----------------------

outcome(varname)specifies thatvarnameis the outcome variable.

commands(string)specifies which command (eitherregress,logit,mlogitorologit) should be used when fitting each of the parametric models. The variable name is followed by a colon (:), which is followed by the command name, with a comma (,) separating the different variables.Commands should be specified for the models for the outcome variable, time-varying confounders and the time-varying exposure. If there is censoring due to death, then the command used for the model for death should also be specified.

For a survival outcome (and death, if applicable), the command should be chosen as

logit, since the outcome (or death) is given (and simulated) as a sequence of binary variables.

equations(string)specifies the right-hand side of the equations used when fitting the models listed above. The name of the dependent variable is followed by a colon, which is followed by the list of independent variables. A comma (,) should separate the equations for the different dependent variables.Since the data are stored in long format, lagged variables will need to be used (see below) to incorporate the dependence on data from previous visits.

The equation for any particular variable, for example a time-varying confounder L, must be the same at each visit.

The prefix

i.should be used for any variable that is to be treated as categorical.

idvar(varname)specifies that varname is the numeric variable identifying the subject.

tvar(varname)specifies that varname is the numeric variable identifying the time-point.

varyingcovariates(varlist)specifies that varlist are the time-varying covariates. If lagged versions of these variables are to be used, only the unlagged versions should be included in this list.

intvars(varlist)specifies that varlist are the variables on which interventions are to be specified. If lagged versions of these variables are to be used, only the unlagged versions should be included in this list.

interventions(string)specifies the exact interventions to be compared. Different interventions should be separated by a comma (,) and different commands within one intervention should be separated by a backwards slash (\).

dynamicspecifies that the regimes to be compared are dynamic. If this option is not specified, it is assumed that the regimes to be compared (except for the observational regime) are all static.

eofuspecifies that the outcome is measured only at the end of follow-up. If this option is not specified, it is assumed that the outcome is time-to-event.

pooledspecifies that the models defined by thecommandandequationoptions above (and the imputation models - see below - if relevant) should be fitted to data from all visits at once, pooling across time-points. If this option is not specified, the models are fitted separately at each visit.

monotreatspecifies that the time-varying exposure in the observational data is binary, and changes at most once (from zero to one). Thus the exposure data for a given subject consists of a sequence of zeros followed by a sequence of ones (or a sequence containing only zeros, or only ones). This is common in many settings in which treatment may be initiated at some point, but never discontinued. Specifying this option only affects the way in which data are simulated under the observational regime. When using themonotreatoption, the corresponding command for simulating the time-varying exposure should be specified aslogit.

death(varname)gives the name of the variable (a sequence of binary variables at each time-point) which takes the value 0 if a subject is still alive at that time-point and 1 if a subject died between the previous and current time-points. (No further records following death will be included in the original dataset.)It is assumed that all censoring (before the final visit) where death=0 is due to loss to follow-up. Simulations are then drawn to mimic a situation in which there are deaths but no losses to follow-up.

If the

deathoption is not specified, all censoring (before the final visit) is assumed to be due to loss to follow-up and simulations are drawn to mimic no losses to follow-up.

derived(varlist)lists all the variables which are to be derived from other variables, such as interactions. Lagged variables themselves should not be included here, but variables derived using one or more lagged variables should be included. The derived variables must exist in the original dataset.

derrules(string)describes how the derived variables are to be obtained from the other variables. For example, if the variable al is to be created as the product of a and l, the code is derrules(al:a*l) (and al should be included inderived(varlist)above). The rules for generating more than one derived variable should be separated using a comma.

fixedcovariates(varlist)lists the time-fixed covariates. These do not depend on the time-varying exposure and thus are not simulated.

laggedvars(varlist)lists the lagged variables. The lagged variables must exist in the original dataset.

lagrules(string)gives further details of the lagged variables. For example, if the variable a_lag is the lagged version of a, and a_lag2 is the double-lagged version of a, this would be denoted as lagrules(a_lag:a 1, a_lag2:a 2).

msm(string)specifies the form of the marginal structural model, for example, msm(regress y a_lag a_lag2) or msm(stcox a_lag a_lag2).

minsimsuppresses the simulation of the outcome, and uses just the predicted means instead. This reduces unnecessary MC error in some contexts. Note: categorical outcomes fitted using ologit or mlogit are still simulated even with minsim.

impute(varlist)gives a list of the variables that contain missing values to be imputed via the method of single stochastic imputation using chained equations.

imp_cmd(string)specifies which command (eitherregress,logit,mlogitorologit) should be used when fitting each of the imputation models. The syntax is the same as for thecommandsoption described above.

imp_eq(string)specifies the RHS of each of the equations to be used for fitting each of the imputation models. The syntax is the same as for theequationsoption described above.

imp_cycles(#)specifies the number of cycles of chained equations to be used in the imputation procedure. The default is 10.

simulations(#)specifies the size of the Monte Carlo simulated dataset. The default is the same size as the observed dataset, but for computational reasons, it can be smaller.

samples(#)specifies the number of bootstrap samples. The default is 1,000.

seed(#)sets the random-number seed to#.

allspecifies that all bootstrap confidence intervals are to be displayed (normal, percentile, bias corrected, and bias corrected and accelerated). The default is to give normal-based bootstrap confidence intervals only. See bootstrap.

graphspecifies that a Kaplan-Meier plot of the survival curves under each intervention be displayed. This option is only relevant for a time-varying confounding analysis with a time-to-event outcome.

saving(string)saves the dataset containing the original observational data and all the Monte Carlo simulations in a Stata dataset namedstring. The dataset contains a variable _int which takes the value 0 for the observational data, the value 1 for the simulations corresponding to intervention 1, and so on for each of the n specified interventions. Finally, the Monte Carlo simulations under the observational regime appear at the end of the dataset, with _int taking the value n+1.

replacespecifies that if the .dta file given in the optionsaving(string)already exists, it should be overwritten.

+------------------------------+ ----+ gformula (mediation options) +-------------------------------------

mediationspecifies that the analysis is a mediation analysis. If this option is not specified, then a time-varying confounding analysis is assumed.

outcome(varname)specifies thatvarnameis the outcome variable.

commands(string)specifies which command (eitherregress,logit,mlogitorologit) should be used when fitting the simulation models. The variable name is followed by a colon (:), which is followed by the command name, with a comma (,) separating the different variables.Models must be specified for the mediator(s), outcome, and the post-baseline confounders of the mediator-outcome relationship that are affected by the exposure.

equations(string)specifies the right-hand side of the equations used when fitting the models listed above. The name of the dependent variable is followed by a colon (:), which is followed by the list of independent variables. A comma (,) should separate the equations for the different dependent variables.The prefix

i.should be used for any variable that is to be treated as categorical.

derived(varlist)lists all the variables which are to be derived from other variables, such as interactions.

derrules(string)describes how the derived variables are to be obtained from the other variables. For example, if the variable al is to be created as the product of a and l, the code is derrules(al:a*l) (and al should be included inderived(varlist)above). The rules for generating more than one derived variable should be separated using a comma.

msm(string)specifies the form of the marginal structural model, for example, msm(regress y x m) or msm(logit y x m xm).

exposure(varlist)specifies the exposure variable(s).

mediator(varlist)specifies the mediator variable(s).

control(string)specifies the value(s) at which the mediator(s) should be controlled for the controlled direct effect. If this option is not specified, only natural direct/indirect effects are estimated.

obespecifies that there is only one binary exposure, and that the comparisons should be made between X=1 and X=0. If neither this noroce(see next option) is specified, comparisons are made between the distribution of X in the observed data, and the baseline value(s).

ocespecifies that there is only one categorical exposure, and that the comparisons should be made between each non-baseline level of X and the baseline level, as specified using thebaselineoption above. If neither this norobeis specifid, comparisons are made between the distribution of X in the observed data, and the baseline value(s).

baseline(string)specifies the value(s) of the exposure(s) to be taken as baseline value(s).

base_confs(varlist)specifies the confounder(s) of the exposure-outcome relationship(s).

post_confs(varlist)specifies the confounder(s) of the mediator-outcome relationship(s).

linexpspecifies that the exposure is continuous and that its effect is assumed to be linear. Thus, the CDE is defined as E{Y(X+1,m)-Y(X,m)}, and so on.

minsimsuppresses the simulation of the outcome, and uses just the predicted means instead. This reduces unnecessary MC error in some contexts. Note: categorical outcomes fitted using ologit or mlogit are still simulated even with minsim. In addition, if the optionlinexphas been specified, even fewer outcomes need be simulated, and these unnecessary simulations are avoided when this option is specified.

moreMCspecifies that the number of MC simulations will be greater than the sample size. This can be useful in reducing MC error.

impute(varlist)gives a list of the variables that contain missing values to be imputed via the method of single stochastic imputation using chained equations.

imp_cmd(string)specifies which command (eitherregress,logit,mlogitorologit) should be used when fitting each of the imputation models. The syntax is the same as for thecommandsoption described above.

imp_eq(string)specifies the RHS of each of the equations to be used for fitting each of the imputation models. The syntax is the same as for theequationsoption described above.

imp_cycles(#)specifies the number of cycles of chained equations to be used in the imputation procedure. The default is 10.

simulations(#)specifies the size of the Monte Carlo simulated dataset. The default is the same size as the observed dataset.

samples(#)specifies the number of bootstrap samples. The default is 1,000.

seed(#)sets the random-number seed to#.

allspecifies that all bootstrap confidence intervals are to be displayed (normal, percentile, bias corrected, and bias corrected and accelerated). The default is to give normal-based bootstrap confidence intervals only. See bootstrap.

saving(string)saves the dataset containing the original observational data and all the Monte Carlo simulations in a Stata dataset namedstring.

replacespecifies that if the .dta file given in the optionsaving(string)already exists, it should be overwritten.

AuthorsRhian Daniel, Bianca De Stavola and Simon Cousens. Centre for Statistical Methodology, London School of Hygiene and Tropical Medicine. Rhian.Daniel@LSHTM.ac.uk

Further readingDaniel, R. M., De Stavola, B. L., and Cousens, S. N. 2011. gformula: Estimating causal effects in the presence of time-varying confounding or mediation using the g-computation formula.

The Stata Journal11(4):479-517.Robins, J. M. 1986. A new approach to causal inference in mortality studies with a sustained exposure period - application to control of the healthy worker survivor effect.

Mathematical Modelling7:1393-1512.Robins, J. M., and Hernan, M. A. 2009.

Longitudinal Data Analysis, chap. 23: Estimation of the causal effects of time-varying exposures, 553-599. New York: Chapman and Hall / CRC Press.Pearl, J. 2001. Direct and Indirect Effects. In

Proceedings of the 17thConference on Uncertainty in Artificial Intelligence. 411-420.Didelez, V. 2006. Direct and Indirect Effects of Sequential Treatments. In

Proceedings of the 22nd Conference on Uncertainty in ArtificialIntelligence. 138-146.

Acknowledgements

gformulamakes use of thedetangleandformatlistprocedures fromice, by kind permission of Patrick Royston.This work was supported by the Medical Research Council, UK (Grant number: G0701024).

We are very grateful to Daniela Zugna, Deborah Ford and Linda Harrison for the suggestions they have made to improve this command.