{smcl}
{* *! version 30aug2024}{...}
{smcl}
{pstd}{ul:Cluster sampling with cross-fit folds - Basic example with {help pystacked}}{p_end}

{pstd}Load the data, define global macros and set the seed.{p_end}

{phang2}. {stata "webuse nlsw88, clear"}{p_end}
{phang2}. {stata "gen lwage = ln(wage)"}{p_end}
{phang2}. {stata "global Y lwage"}{p_end}
{phang2}. {stata "global D union"}{p_end}
{phang2}. {stata "global X age-c_city hours-tenure"}{p_end}
{phang2}. {stata "set seed 42"}{p_end}

{pstd}Initialize the model.
The {opt fcluster(industry)} ("fold-cluster") option tells {opt ddml}
to ensure that clusters (here, identified by the variable {opt industry})
are not split across cross-fit folds, i.e., each cluster appears in only one cross-fit fold.
Here we specify 2 cross-fit folds,
so all observations for each cluster will appear in either fold 1 or in fold 2.
NB: This example is somewhat artificial, because there are only 12 clusters (industries).{p_end}

{phang2}. {stata "ddml init partial, kfolds(2) fcluster(industry)"}{p_end}
{phang2}. {stata "tab industry m0_fid_1"}{p_end}

{pstd}Since there are 12 clusters defined by {opt industry},
we could achieve the same cross-fit split either by specifying {opt fcluster(industry)},
or by using {opt fcluster(industry)} as the fold identifier and specifying {opt foldvar(industry)}.
(NB: The split is the same but the fold numbering is different.){p_end}

{phang2}. {stata "ddml init partial, foldvar(industry)"}{p_end}
{phang2}. {stata "tab industry m0_fid_1"}{p_end}

{phang2}. {stata "ddml init partial, kfolds(12) fcluster(industry)"}{p_end}
{phang2}. {stata "tab industry m0_fid_1"}{p_end}

{pstd}Estimation is standard,
but to obtain cluster-robust SEs the covariance estimator
needs to be requested with {opt ddml estimate}:{p_end}

{phang2}. {stata "ddml E[Y|X]: pystacked $Y $X"}{p_end}
{phang2}. {stata "ddml E[D|X]: pystacked $D $X"}{p_end}
{phang2}. {stata "ddml crossfit"}{p_end}
{phang2}. {stata "ddml estimate, cluster(industry)"}{p_end}