cross-validate package sticker design

xv != xi

Cross-Validation in Stata

Billy Buchanan, Sr Research Scientist, SAG Corporation
Steven Brownell, Sr Economist, SAG Corporation

Logo for SAG Corporation Service Disabled Veteran Owned Small Business Emblem

Slides available at: https://411steven.github.io/stataConference2024

A Quick Poll

Motivation

  • At 2023 Conference a user requested capabilities to make cross-validation easier
  • Limited cross-validation capabilities in the Stata ecosystem
  • ML/AI techniques rely heavily on cross-validation
  • The utility of inferential statistics is bound by their generalizability

What is Cross-Validation?

  • A process used to evaluate out-of-sample performance of a model using existing data
  • Cross-validation can be used to optimize the model's hyperparameters and/or model selection
  • The process has four basic steps: split, fit, predict, validate/evaluate
  • The process has either two or three phases based on the splitting technique: training, validation, and/or testing

All Cross-Validation Techniques are K-Fold Techniques!!!

TT Split w/ K = 1 Folds

TT Split w/ K > 1 Folds

TVT Split w/ K = 1 Folds

TVT Split w/ K > 1 Folds

The crossvalidate package

Organization

  • xv prefix for the majority of use cases
  • xvloo prefix for leave-one-out cross-validation use cases
  • Step-based Commands
  • Utility Commands
  • Mata Library with utility functions and validation metrics

Prefix Commands

					
						
						
					
				

Prefix Command Syntax example

					
						
					
				

Other Commands

  • splitit, fitit, predictit, and validateit implement the individual steps of cross-validation
  • classify ensures classification models return classes and not predicted probabilities
  • cmdmod uses metaprogramming to adjust the command to apply to the appropriate subset of data
  • state adds data characteristics and returns values about the state of the computing environment

splitit

  • Requires a single proportion for TT splits or two proportions for TVT splits
  • Can provide clustered splitting using the uid option
  • Implements K-Folds with the kfold option and LOO with the loo option
  • Stratification and panel/time-series are not implemented currently.

Libxv

  • Includes parsing and string substitution functions
  • Implements a cross-tabulation function returning a struct for classification methods
  • A function to verify hierarchies when using clustered splitting
  • Some statistical functions (e.g., dpois, kappawgts)
  • And > 40 validation metrics for continuous, binary, and multiclass responses

User Supplied Metrics

					
						
					
				

Demos

Cross-validation for Continuous Outcomes

					
						
					
				

Cross-validation for Logit

					
						
					
				

Cross-validation for Ordinal Outcomes

					
						
					
				

What's left?

Future work

  • Implementing stratification and xt/ts appropriate methods in splitit
  • Implementing hyperparameter tuning with grid search, simulated annealing, and/or "Bayesian" optimization.
  • Extending metrics to provide micro and macro aggregation and metrics for other types of models

crossvalidate is easy to use, versatile, and extensible

Separation of concerns is not just for computer scientists!