Description
crossfold performs k-fold cross-validation on a specified model in order to evaluate a model's ability to fit out-of-sample data.
This procedure splits the data randomly into k partitions, then for each partition it fits the specified model using the other k-1 groups and uses the resulting parameters to predict the dependent variable in the unused group.
Finally, crossfold reports a measure of goodness-of-fit from each attempt. The default evaluation metric is root mean squared error (RMSE).
Syntax
crossfold model [model_if] [model_in] [model_weight], [eif()] [ein()] [eweight(varname)] [stub(string)] [k(value)] [loud] [mae] [r2] [model_options]
Options Description ------------------------------------------------------------------------- eif; ein Error evaluation if and in specifications place restrictions on the out-of-sample set that should be fit. Modelling if and in restrictions should be specified with the model. eweight Weighting for error evaluation purposes. Model weights, identical or not, should be specified after the model. stub() Specifies a stub name for naming estimation results and for the results matrix. The default is est. k() Specifies a number of folds to carry out. The default is 5, and k cannot exceed 300 or the number of observations. loud Displays each model as it is fit. mae Calculates mean absolute errors (MAE) instead of RMSE. r2 Calculates psuedo-R-squared (the square of the correlation coefficient of the predicted and actual values of the dependent variable) instead of RMSE. model_options Modelling command options (such as fe for xtreg). -------------------------------------------------------------------------
Examples
. sysuse nlsw88 (NLSW, 1988 extract)
. crossfold reg wage union
| RMSE -------------+----------- est1 | 4.171849 est2 | 4.105884 est3 | 4.038483 est4 | 4.151482 est5 | 4.171727
. crossfold reg wage union, mae
| MAE -------------+----------- est1 | 2.99209 est2 | 3.13541 est3 | 3.158161 est4 | 3.035878 est5 | 3.006016
.crossfold reg wage hours grade i.race i.industry i.occupation, r2
| Pseudo-R2 -------------+----------- est1 | .2036234 est2 | .1804039 est3 | .2213548 est4 | .2159976 est5 | .1556564
. crossfold qreg wage union [weight=hours], eweight(hours) mae (importance weights assumed)
| MAE -------------+----------- est1 | 3.078402 est2 | 2.864632 est3 | 2.846198 est4 | 2.989049 est5 | 2.990051
. crossfold qreg wage union collgrad age grade [weight=hours], eweight(hours) k(3) mae (importance weights assumed)
| MAE -------------+----------- est1 | 2.449628 est2 | 2.700219 est3 | 2.588182
Saved Results
crossfold saves the model errors in the matrix r(stub) (which is named r(est) if no stub name is specified).
It also saves the model parameters under the names stub1 ... stubk. They can be recalled using estimates restore name.
Author
Benjamin Daniels bbdaniels@gmail.com
References
Schonlau, Matthias. "Boosted regression (boosting): An intoductory tutorial and a Stata plugin." The Stata Journal (2005). 5, Number 3, pp.330-354.
FAQ: What are pseudo R-squareds? UCLA: Academic Technology Services, Statistical Consulting Group. http://www.ats.ucla.edu/stat/mult_pkg/faq/general/psuedo_rsquareds.htm (accessed February 14, 2012).