{smcl} {* *! version 1.1 oct 30 2022}{...} {vieweralsosee "" "--"}{...} {vieweralsosee "Install command2" "ssc install cv_kfold"}{...} {vieweralsosee "Help command2 (if installed)" "help ck_kfold"}{...} {viewerjumpto "Syntax" "cv_kfold##syntax"}{...} {viewerjumpto "Description" "cv_kfold##description"}{...} {viewerjumpto "Options" "cv_kfold##options"}{...} {viewerjumpto "Remarks" "cv_kfold##remarks"}{...} {viewerjumpto "Examples" "cv_kfold##examples"}{...} {title:Title} {phang} {bf:cv_kfold} {hline 2} Module to implement k-fold cross-validation procedures {marker syntax}{...} {title:Syntax} {p 8 17 2} {cmdab:cv_kfold} [{cmd:,} {it:options}] {synoptset 20 tabbed}{...} {synopthdr} {synoptline} {syntab:Optional} {synopt:{opt k(#)}} Indicates the number of equal sizes subsamples will be used for the estimation of the k-fold cross validation. Default value is 5. {p_end} {synopt:{opt reps(#)}} Indicates times the cross validation procedure will be implemented. Default value is 1.{p_end} {synopt:{opt seed(str)}} The author can provide a seed number for the generation of the random groups. {p_end} {synoptline} {p2colreset}{...} {p 4 6 2} {marker description}{...} {title:Description} {pstd}This {cmd:cv_kfold} is a post estimation command that implements k-fold crossvalidation for various stata commands. {p_end} {pstd}The current version of this command can be used after: {cmd:regress}, {cmd:logit}, {cmd:probit}, {cmd:logit}, {cmd:cloglog}, {cmd:poisson}, {cmd:nbreg}, {cmd:mlogit}, {cmd:mprobit}, {cmd: ologit}, and {cmd: oprobit}. {p_end} {pstd}When used after {cmd:regress}, {cmd:cv_kfold} estimates and reports the average unweighted Root Mean Squared error (RMSE) across all repetitions. For all other estimation commands, it reports the unweighted model loglikelihood function (AvLL). {p_end} {pstd}Internally, {cmd:cv_kfold} uses the syntax from the previously estimated command for the k-fold cross validation producedure. Using the overall estimation sample, {cmd:k} random groups of equalsize are created, and the same previously model syntax is used to re-estimate the model. {p_end} {pstd}For example, if one uses a 5-folds, 4 of the 5 subsamples are used to estimate the model, leaving the 5th subsample to make an out-of sample prediction and evaluate the model using the RMSE or the AvLL. when the option reps() is used, the command repeats the k-fold procedure N times, and reports the average RMSE and AVLL across all repetitions, but stores the estatistics of each individual repetition in a separate matrix. {pstd} If you are interested in a leave-one-out cross validation procedure for {cmd:regress}, see {cmd:cv_regress} available from ssc. {pstd} The command has been tested under Stata 14. But it does not work with version control. {marker examples}{...} {title:Examples} {pstd} Set up {p_end} {pstd}{stata ssc install frause}{p_end} {pstd}{stata set seed 10101}{p_end} {pstd}{stata frause oaxaca, clear}{p_end} {pstd} Leave on out cross validation {p_end} {pstd}{stata ssc install cv_regress}{p_end} {pstd}{stata regress lnwage educ exper tenure female age agesq }{p_end} {pstd}{stata cv_regress}{p_end} {pstd} k-fold cross validation {p_end} {pstd}{stata regress lnwage educ exper tenure female age agesq }{p_end} {pstd}{stata cv_kfold}{p_end} {pstd} k-fold cross validation, with 5 repetitions {p_end} {pstd}{stata regress lnwage educ exper tenure female age agesq }{p_end} {pstd}{stata cv_kfold, reps(5) }{p_end} {pstd}{stata matrix list r(msqr) }{p_end} {pstd} k-fold for other type of models. Logit, poisson and mlogit {p_end} {pstd}{stata "drop if lnwage==." } {p_end} {pstd}{stata "gen dwage=lnwage>3.4" } {p_end} {pstd}{stata "gen wage=round(exp(lnwage))"}{p_end} {pstd}{stata "xtile qwage=lnwage, n(5) "}{p_end} {pstd}{stata "logit dwage educ exper tenure female age agesq" }{p_end} {pstd}{stata "cv_kfold, reps(5)" }{p_end} {pstd}{stata "matrix list r(msqr)" } {pstd} Currently, Poisson model only works if the Dep variable is {p_end} {pstd}{stata "poisson wage educ exper tenure female age agesq"}{p_end} {pstd}{stata "cv_kfold, reps(5) "}{p_end} {pstd}{stata "matrix list r(msqr) "} {p_end} {pstd}{stata "mlogit qwage educ exper tenure female age agesq"}{p_end} {pstd}{stata "cv_kfold, reps(5)" }{p_end} {pstd}{stata "matrix list r(msqr)" } {p_end} {pstd}{stata "ologit qwage educ exper tenure female age agesq"}{p_end} {pstd}{stata "cv_kfold, reps(5)" }{p_end} {pstd}{stata "matrix list r(msqr)" } {p_end} {title:Author} {pstd} Fernando Rios-Avila{break} Levy Economics Institute of Bard College{break} Annandale-on-Hudson, NY{break} friosavi@levy.org {title:Acknowledgement } Many thanks to Morteza Saharkhiz for suggesting extending he command to ologit and oprobit models. {title:Also see} {p 4 14 2} Help: {helpb cv_regress}