{smcl} {* *! version 1.0.0 28Oct2020}{...} {viewerjumpto "Syntax" "crossvalidate##syntax"}{...} {viewerjumpto "Description" "crossvalidate##description"}{...} {viewerjumpto "Options" "crossvalidate##options"}{...} {viewerjumpto "Examples" "crossvalidate##examples"}{...} {viewerjumpto "Authors" "crossvalidate##authors"}{...} {...}{* NB: these hide the newlines } {...} {...} {title:Title} {p2colset 5 23 25 2}{...} {p2col :{cmd:crossvalidate} {hline 2}} k-fold Crossvalidation {p_end} {p2colreset}{...} {marker syntax}{...} {title:Syntax} {p 8 16 2} {cmd:crossvalidate} {newvar} {cmd:estimation_command} {depvar} {indepvars} {ifin} {cmd:,} [ folds(#) gen(newvar) shuffle {it:options} ] {pstd} {cmd:crossv} can be used as a synonym for {cmd:crossvalidate} {synoptset 20 tabbed}{...} {synopthdr} {synoptline} {synopt :{opth folds:(crossvalidate##folds:#)}} Number of crossvalidation folds. {p_end} {synopt :{opt gen(newvar)}} Optionally, save the variable that splits observations into folds.{p_end} {synopt :{opt shuffle}} Optionally, put the data in random order.{p_end} {synopt :{opt options}} Additional options are passed to the estimation command {p_end} {synoptline} {marker description}{...} {title:Description} {pstd} {cmd:crossvalidate} computes k-fold cross-validated predictions from any Stata estimation command. The command breaks a dataset into a number of subsets ("folds"), and for each runs an estimator on everything but that subset, and predicts results. {cmd:crossvalidate} stores predicted values in a newly generated variable {cmd:newvar}. Predicted values are generated by issuing the command {cmd: predict newvar} for each fold and, depending on the estimation command, may represent probabilities, class predictions, or continuous values. {pstd} {cmd:crossvalidate} passes whatever options you give it directly to the estimator; it handles only the folding. Examples of {cmd:estimation_command} include {cmd:svmachines} and {cmd:logistic}. {title:Remarks} {pstd} Only estimation commands that allow the use of {cmd:predict} after the estimation command can be used. The program does not currently support the prediction of multiple variables as would be needed, for example, for multinomial logistic regression. {marker options}{...} {title:Options} {phang} {marker folds}{...} {opt folds:(#)} Number of folds. Common values are 5 and 10. By default, {cmd:folds(5)} is used. {p_end} {phang} {marker shuffle}{...} {opt shuffle} Optionally, generates random folds. This option uses random values; set the {cmd:seed} if reproducibility is required. By default, folds are in sort order. {p_end} {phang} {opt gen(newvar)} Optionally, save the variable that splits the observations into folds into new variable {cmd: newvar}. The folds are labeled from 1, 2,...,. This is useful to compute the average evaluation criterion for each fold later. {p_end} {phang} {marker estimation_options}{...} {opt options} Additional options are passed to the estimation command. {p_end} {marker examples}{...} {title:Examples} {pstd} Typical classification with support vector machines: {phang}{cmd:. sysuse auto} {phang}{cmd:. crossvalidate P svmachines foreign headroom gear_ratio weight, folds(5) type(svc) gamma(0.4) c(51) } {phang}{cmd:. n err = foreign != P } {phang}{cmd:. qui sum err } {phang}{cmd:. di "Cross-validated error rate: `r(mean)'" } {pstd} Nearest Neighbor classification: {phang} {cmd:. crossvalidate P discrim knn headroom gear_ratio weight, k(3) group(foreign)} {marker authors}{...} {title:Authors} {pmore} Matthias Schonlau {p_end} {pmore} Nick Guenther {p_end}