{smcl}
{* *! version 21jan2023}{...}
{hline}
{cmd:help ddml}{right: v1.2}
{hline}
{title:Title}
{p2colset 5 19 21 2}{...}
{p2col:{hi: qddml} {hline 2}}Stata program for Double Debiased Machine Learning{p_end}
{p2colreset}{...}
{pstd}
{opt ddml} implements algorithms for causal inference aided by supervised
machine learning as proposed in
{it:Double/debiased machine learning for treatment and structural parameters}
(Econometrics Journal, 2018). Five different models are supported, allowing for
binary or continous treatment variables and endogeneity, high-dimensional
controls and/or instrumental variables.
{opt ddml} supports a variety of different ML programs, including
but not limited to {helpb lassopack} and {helpb pystacked}.
{pstd}
{opt qddml} is a wrapper program of {cmd:ddml}. It provides a convenient
one-line syntax with almost the full flexibility of {cmd:ddml}.
The main restriction of {cmd:qddml} is that it only allows to be used
with one machine learning program at the time, while {cmd:ddml}
allow for multiple learners per reduced form equation.
{pstd}
{opt qddml} uses stacking regression ({helpb pystacked})
as the default machine learning program.
{pstd}
{opt qddml} relies on {helpb crossfit}, which can be used as a standalone
program.
{p 8 14 2}
{cmd:qddml}
{it:depvar} {it:regressors} [{cmd:(}{it:hd_controls}{cmd:)}]
{cmd:(}{it:endog}{cmd:=}{it:instruments}{cmd:)}
[{cmd:if} {it:exp}] [{cmd:in} {it:range}]
{opt model(name)}
{bind:[ {cmd:,}}
{opt cmd(string)}
{opt cmdopt(string)}
{opt mname(string)}
{opt noreg}
{opt ...} ]}
{pstd}
Since {opt qddml} uses {helpb pystacked} per default,
it requires Stata 16 or higher, Python 3.x and at least scikit-learn 0.24. See
{helpb python:this help file}, {browse "https://blog.stata.com/2020/08/18/stata-python-integration-part-1-setting-up-stata-to-use-python/":this Stata blog entry}
and
{browse "https://www.youtube.com/watch?v=4WxMAGNhcuE":this Youtube video}
for how to set up
Python on your system.
In short, install Python 3.x (we recommend Anaconda)
and set the appropriate Python path using {cmd:python set exec}.
If you don't have Stata 16+,
you can still use {cmd:pystacked} with programs that don't rely on Python,
e.g., using the option {opt cmd(rlasso)}.
{pstd}
Please check the {helpb qddml##examples:examples} provided at the end of the help file.
{marker syntax}{...}
{title:Options}
{synoptset 20}{...}
{synopthdr:General}
{synoptline}
{synopt:{opt model(name)}}
the model to be estimated; allows for {it:partial}, {it:interactive},
{it:iv}, {it:fiv}, {it:late}. See {helpb ddml##models:here} for an overview.
{p_end}
{synopt:{opt mname(string)}}
name of the DDML model. Allows to run multiple DDML
models simultaneously. Defaults to {it:m0}.
{p_end}
{synopt:{opt kfolds(integer)}}
number of cross-fitting folds. The default is 5.
{p_end}
{synopt:{opt fcluster(varname)}}
cluster identifiers for cluster randomization of random folds.
{p_end}
{synopt:{opt foldvar(varname)}}
integer variable with user-specified cross-fitting folds.
{p_end}
{synopt:{opt reps(integer)}}
number of re-sampling iterations, i.e., how often the cross-fitting procedure is
repeated on randomly generated folds.
{p_end}
{synopt:{opt shortstack}} asks for short-stacking to be used.
Short-stacking runs contrained non-negative least squares on the
cross-fitted predicted values to obtain a weighted average
of several base learners.
{p_end}
{synopt:{cmdab:r:obust}}
report SEs that are robust to the
presence of arbitrary heteroskedasticity.
{p_end}
{synopt:{opt vce(type)}}
select variance-covariance estimator, see {helpb regress##vcetype:here}
{p_end}
{synopt:{opt cluster(varname)}}
select cluster-robust variance-covariance estimator.
{p_end}
{synopt:{opt noreg}}
do not add {helpb regress} as an additional learner.
{p_end}
{synoptset 20}{...}
{synopthdr:Learners}
{synoptline}
{synopt:{opt cmd(string)}}
ML program used for estimating conditional expectations.
Defaults to {helpb pystacked}.
See {helpb ddml##compatibility:here} for
other supported programs.
{p_end}
{synopt:{opt ycmd(string)}}
ML program used for estimating the conditional expectations of the outcome {it:Y}.
Defaults to {opt cmd(string)}.
{p_end}
{synopt:{opt dcmd(string)}}
ML program used for estimating the conditional expectations of the treatment variable(s) {it:D}.
Defaults to {opt cmd(string)}.
{p_end}
{synopt:{opt zcmd(string)}}
ML program used for estimating conditional expectations of instrumental variable(s) {it:Z}.
Defaults to {opt cmd(string)}.
{p_end}
{synopt:{opt *cmdopt(string)}}
options that are passed on to ML program.
The asterisk {cmd:*} can be replaced with either nothing
(setting the default for all reduced form equations),
{cmd:y} (setting the default for the conditional expectation of {it:Y}),
{cmd:d} (setting the default for {it:D})
or {cmd:z} (setting the default for {it:Z}).
{p_end}
{synopt:{opt *vtype(string)}}
variable type of the variable to be created. Defaults to {it:double}.
{it:none} can be used to leave the type field blank
(this is required when using {cmd:ddml} with {helpb rforest}.)
The asterisk {cmd:*} can be replaced with either nothing
(setting the default for all reduced form equations),
{cmd:y} (setting the default for the conditional expectation of {it:Y}),
{cmd:d} (setting the default for {it:D})
or {cmd:z} (setting the default for {it:Z}).
{p_end}
{synopt:{opt *predopt(string)}}
{cmd:predict} option to be used to get predicted values.
Typical values could be {opt xb} or {opt pr}. Default is
blank. The asterisk {cmd:*} can be replaced with either nothing
(setting the default for all reduced form equations),
{cmd:y} (setting the default for the conditional expectation of {it:Y}),
{cmd:d} (setting the default for {it:D})
or {cmd:z} (setting the default for {it:Z}).
{p_end}
{synoptset 20}{...}
{synopthdr:Output}
{synoptline}
{synopt:{opt verb:ose}}
show detailed output
{p_end}
{synopt:{opt vverb:ose}}
show even more output
{p_end}
{marker models}{...}
{title:Models}
{pstd}
See {helpb ddml##models:here}.
{marker compatibility}{...}
{title:Compatible programs}
{pstd}
See {helpb ddml##compatibility:here}.
{marker examples}{...}
{title:Examples}
{pstd}
Below we demonstrate the use of {cmd:qddml} for each of the 5 models supported.
Note that estimation models are chosen for demonstration purposes only and
kept simple to allow you to run the code quickly.
Please also see the examples in the {helpb ddml##examples:ddml help file}
{pstd}{ul:Partially linear model.}
{pstd}Preparations: we load the data, define global macros and set the seed.{p_end}
{phang2}. {stata "use https://github.com/aahrens1/ddml/raw/master/data/sipp1991.dta, clear"}{p_end}
{phang2}. {stata "global Y net_tfa"}{p_end}
{phang2}. {stata "global D e401"}{p_end}
{phang2}. {stata "global X tw age inc fsize educ db marr twoearn pira hown"}{p_end}
{phang2}. {stata "set seed 42"}{p_end}
{pstd}The options {cmd:model(partial)} selects the partially linear model
and {cmd:kfolds(2)} selects two cross-fitting folds.
We use the options {cmd:cmd()} and {cmd:cmdopt()} to select
random forest for estimating the conditional expectations.{p_end}
{pstd}Note that we set the number of random folds to 2, so that
the model runs quickly. The default is {opt kfolds(5)}. We recommend
to consider at least 5-10 folds and even more if your sample size is small.{p_end}
{pstd}Note also that we recommend to re-run the model multiple time on
different random folds, see options {opt reps(integer)}.{p_end}
{phang2}. {stata "qddml $Y $D ($X), kfolds(2) model(partial) cmd(pystacked) cmdopt(type(reg) method(rf))"}{p_end}
{pstd}{ul:Partially linear IV model.}
{pstd}Preparations: we load the data, define global macros and set the seed.{p_end}
{phang2}. {stata "use https://statalasso.github.io/dta/AJR.dta, clear"}{p_end}
{phang2}. {stata "global Y logpgp95"}{p_end}
{phang2}. {stata "global D avexpr"}{p_end}
{phang2}. {stata "global Z logem4"}{p_end}
{phang2}. {stata "global X lat_abst edes1975 avelf temp* humid* steplow-oilres"}{p_end}
{phang2}. {stata "set seed 42"}{p_end}
{pstd}Since the
data set is very small, we consider 30 cross-fitting folds.{p_end}
{pstd}We need to add the option {opt vtype(none)} for {helpb rforest} to
work with {cmd:ddml} since {helpb rforests}'s {cmd:predict} command doesn't
support variable types.{p_end}
{phang2}. {stata "qddml $Y ($X) ($D=$Z), kfolds(30) model(iv) cmd(rforest) cmdopt(type(reg)) vtype(none) robust"}{p_end}
{pstd}{ul:Interactive model--ATE and ATET estimation.}
{pstd}Preparations: we load the data, define global macros and set the seed.{p_end}
{phang2}. {stata "webuse cattaneo2, clear"}{p_end}
{phang2}. {stata "global Y bweight"}{p_end}
{phang2}. {stata "global D mbsmoke"}{p_end}
{phang2}. {stata "global X mage prenatal1 mmarried fbaby mage medu"}{p_end}
{phang2}. {stata "set seed 42"}{p_end}
{pstd}
Note that we use gradient boosted regression trees for E[Y|X,D] (see {opt ycmdopt()}),
but gradient boosted classification trees for E[D|X] (see {opt dcmdopt()}).
{p_end}
{phang2}. {stata "qddml $Y $D ($X), kfolds(5) reps(5) model(interactive) cmd(pystacked) ycmdopt(type(reg) method(gradboost)) dcmdopt(type(class) method(gradboost))"}{p_end}
{pstd}{cmd:qddml} reports the ATE effect by default. The option {cmd:atet}
returns the ATET estimate.{p_end}
{pstd}If we want retrieve the ATET estimate after estimation,
we can simply use {ddml estimate}.{p_end}
{phang2}. {stata "ddml estimate, atet"}{p_end}
{pstd}{ul:Interactive IV model--LATE estimation.}
{pstd}Preparations: we load the data, define global macros and set the seed.{p_end}
{phang2}. {stata "use http://fmwww.bc.edu/repec/bocode/j/jtpa.dta,clear"}{p_end}
{phang2}. {stata "global Y earnings"}{p_end}
{phang2}. {stata "global D training"}{p_end}
{phang2}. {stata "global Z assignmt"}{p_end}
{phang2}. {stata "global X sex age married black hispanic"}{p_end}
{phang2}. {stata "set seed 42"}{p_end}
{phang2}. {stata "qddml $Y (c.($X)# #c($X)) ($D=$Z), kfolds(5) model(interactiveiv) cmd(pystacked) ycmdopt(type(reg) m(lassocv)) dcmdopt(type(class) m(lassocv)) zcmdopt(type(class) m(lassocv))"}{p_end}
{pstd}{ul:Flexible Partially Linear IV model.}
{pstd}Preparations: we load the data, define global macros and set the seed.{p_end}
{phang2}. {stata "use https://github.com/aahrens1/ddml/raw/master/data/BLP.dta, clear"}{p_end}
{phang2}. {stata "global Y share"}{p_end}
{phang2}. {stata "global D price"}{p_end}
{phang2}. {stata "global X hpwt air mpd space"}{p_end}
{phang2}. {stata "global Z sum*"}{p_end}
{phang2}. {stata "set seed 42"}{p_end}
{pstd}The syntax is the same as in the Partially Linear IV model,
but we now estimate the optimal instrument flexibly.{p_end}
{phang2}. {stata "qddml $Y ($X) ($D=$Z), model(fiv)"}{p_end}
{marker references}{title:References}
{pstd}
Chernozhukov, V., Chetverikov, D., Demirer, M.,
Duflo, E., Hansen, C., Newey, W. and Robins, J. (2018),
Double/debiased machine learning for
treatment and structural parameters.
{it:The Econometrics Journal}, 21: C1-C68. {browse "https://doi.org/10.1111/ectj.12097"}
{marker installation}{title:Installation}
{pstd}
To get the latest stable version of {cmd:ddml} from our website,
check the installation instructions at {browse "https://statalasso.github.io/installation/"}.
We update the stable website version more frequently than the SSC version.
{pstd}
To verify that {cmd:ddml} is correctly installed,
click on or type {stata "whichpkg ddml"}
(which requires {helpb whichpkg}
to be installed; {stata "ssc install whichpkg"}).
{title:Authors}
{pstd}
Achim Ahrens, Public Policy Group, ETH Zurich, Switzerland {break}
achim.ahrens@gess.ethz.ch
{pstd}
Christian B. Hansen, University of Chicago, USA {break}
Christian.Hansen@chicagobooth.edu
{pstd}
Mark E Schaffer, Heriot-Watt University, UK {break}
m.e.schaffer@hw.ac.uk
{pstd}
Thomas Wiemann, University of Chicago, USA {break}
wiemann@uchicago.edu
{title:Also see (if installed)}
{pstd}
Help: {helpb lasso2}, {helpb cvlasso}, {helpb rlasso}, {helpb ivlasso},
{helpb pdslasso}, {helpb pystacked}.{p_end}