{smcl}
{* *! version 1.0 14dec2014}{...}
{cmd:help overfit}{right:dialog: {dialog overfit}}
{hline}
{title:Title}
{pstd}
{bf:overfit} {hline 2} calculates shrinkage statistics to measure overfitting as well as out- and in-sample predictive bias
{title:Syntax}
{pstd}
{cmd:overfit} [{cmd:,} {it:options}] {cmd::} {it:est_command} {it:est_arguments} [{cmd:,} {it:est_options}]
{pstd}
where {it:est_command} is any estimation command that models the expectation of a quantitative outcome {it:y}, and for which post-estimation command {cmd:predict} can predict the expected value of {it:y}.
Refer to the help of the estimation command for the syntax of its arguments {it:(est_arguments)} and options {it:(est_options)}.
Note that {cmd:overfit} neither permits the use of {cmd:if}, {cmd:in} and {cmd:using} conditions nor {cmd:weights} in the estimation command.
{synoptset 20 tabbed}{...}
{synopthdr}
{synoptline}
{syntab:Main}
{synopt:{opt predopt(string)}}specify options to the {cmd:predict} command{p_end}
{syntab:Cross-validation}
{synopt:{opt nbgrp}}specify the number of groups{p_end}
{synopt:{opt nbiter}}specify the number of repetitions{p_end}
{synopt:{opt seed}}initialize the random number generator{p_end}
{synopt:{opt splitnorand}}perform a non-random group assignment{p_end}
{synopt:{opt efficient}}use a more efficient but more computer-intensive method{p_end}
{syntab:Display}
{synopt:{opt noiterbar}}prevent {cmd:overfit} from showing the iteration progress{p_end}
{synopt:{opt noresults}}prevent {cmd:overfit} from displaying the results{p_end}
{synopt:{opt hist(string)}}display and save histograms of the shrinkage statistics{p_end}
{synopt:{opt showmod}}display the estimated model for each group at each iteration{p_end}
{synopt:{opt showslopes}}display the estimated slope statistics at each iteration{p_end}
{syntab:Save detailed results}
{synopt:{opt savemod(string)}}save the model's estimated coefficients for each group at each iteration{p_end}
{synopt:{opt savepred(string)}}save all predictions generated by the estimated model of each group at each iteration{p_end}
{synopt:{opt procnb(integer)}}set the processor number{p_end}
{synoptline}
{p2colreset}{...}
{p 4 6 2}
{title:Description}
{pstd}
Command {cmd:overfit} calculates three shrinkage statistics to measure the amount of overfitting generated by an estimated model as well as its out- and in-sample predictive bias as defined in Bilger and Manning (2015).
Note that in the case of nonlinear models, all these shrinkage statistics are calculated on the (untransformed) scale of interest, which is also referred to as the "raw scale" of the outcome.
{pstd}
In addition to quantifying overfitting and predictive biases, the purpose of these three measures is to provide guidance to the analyst when chosing a model specification.
When overfitting is severe, reducing the model's flexibility by decreasing its nonlinearity and number of explanatory variables can potentially be beneficial.
On the other hand, when the model's in-sample predictive bias is severe, using a more flexible nonlinear form and adding explanatory variables can potentially be beneficial.
Both overfitting and in-sample bias adversly affect the out-of-sample predictive (or forecasting) performance of the model,
and it is important to measure the resulting out-of-sample predictive bias of the model when it is used to predict (or forecast) new outcomes.
These shrinkage statistics complement other model selection statistics such as Mallow's Cp and the Akaike information criterion by focusing on predictive calibration
(how good the predictions are on average as opposed to how well individual outcomes are predicted).
As such, shrinkage statistics are often of interest per se, for instance in economics for government budgeting or health insurance risk adjustment where getting the average prediction of a given population right is important.
{pstd}
Specifically, {cmd:overfit} calculates the following quantities:
{phang}
({opt 1 - delta}) measures the out-of-sample shrinkage or expansion (when the quantity is negative) that arises when the estimated model is used to predict new outcomes.
Such shrinkage or expansion is caused by both model mispecification and overfitting.
Note that {it: delta} is obtained by regressing the observed outcomes on their out-of-sample predictions.
For instance, if {it:(1 - delta)} equals 5%, the deviations above (below) the outcome's average are understimated (overestimated) by out-of-sample predictions by 5%.
{phang}
({opt 1 - alpha}) measures the in-sample shrinkage or expansion (when the quantity is negative) that arises when the estimated model is used to predict outcomes from the estimation sample.
Such shrinkage or expansion is caused by model mispecification in the estimation sample.
Note that {it:alpha} is obtained by regressing the observed outcomes on their in-sample predictions.
For instance, if {it:(1 - alpha)} equals 5%, the deviations above (below) the outcome's average are understimated (overestimated) by in-sample predictions by 5%.
{phang}
({opt 1 - gamma}) measures shrinkage that results from overfitting alone.
This quantity is notably immune to any shrinkage caused by in-sample misspecification,
and is obtained from the following relation:
{it:(1 - delta) = (1 - alpha) + alpha * (1 - gamma)}.
For instance, if {it:(1 - gamma)} equals 5%, the deviations above (below) the outcome's average are understimated (overestimated) by out-of-sample predictions by 5% due to overfitting alone.
{phang}
All shrinkage statistics are calculated using repeated k-fold cross-validation and their means (expressed in percentage) are reported along with their standard errors.
More information on the interpretation and measurement of the above quantities is provided in Bilger and Manning (2015).
{title:Options}
{dlgtab:Main}
{phang}
{opt predopt(string)} specifies option(s) {it: string} to post-estimation command {cmd: predict}.
The option(s) must ensure that {cmd: predict} calculates the expected value of the outcome on the raw scale, e.g. {opt predopt("mu")}.
{dlgtab:Cross-validation}
{phang}
{opt nbgrp(integer)} specifies the number of groups {it:k} for the {it:k}-fold cross-validation. Default is 10.
{phang}
{opt nbiter(integer)} specifies how many times the repeated {it:k}-fold cross-validation has to be repeated. Default is 100.
{phang}
{opt seed(integer)} initializes the random number generator. Default is 1.
{phang}
{opt splitnorand} assignes the observations to the cross-validation groups on a non-random basis by keeping the ordering of the data in memory unchanged.
Observations 1..{opt _N}/{opt nbgrp} are assigned to group 1, observations _N/{opt nbgrp}+1..2*{opt _N}/{opt nbgrp} are assigned to group 2, and so on.
Note that iterating the process would always yield the same result, which is why {opt nbiter} is automatically set at 1 when {opt splitnorand} is specified.
{phang}
{opt efficient} requires {cmd:overfit} to make use of all in-sample predictions when estimating the in-sample slope.
This results in a more efficient estimation of {it:(1 - alpha)} and {it:(1 - gamma)}.
This option slows the estimation down and can be intractable as it needs to internally set {opt nobs} at {cmd:_N} * {opt nbgrp}.
It is recommended to use option {opt efficient} only for small samples where efficiency is an important consideration.
The default is to use only one in-sample prediction per observation.
{dlgtab:Display}
{phang}
{opt noiterbar} prevents {cmd:overfit} from displaying dots to show the iteration progress.
{phang}
{opt noresults} prevents {cmd:overfit} from displaying the estimated shrinkage statistics.
{phang}
{opt hist(string)} displays histograms of the shrinkage statistics and saves them into file {it:string}.gph.
{phang}
{opt showmod} displays the estimated model for each group at each iteration.
Note that {opt nbgrp} * {nbiter} estimation results will be displayed if this option is specified.
{phang}
{opt showslopes} displays the estimated slope statistics {it:alpha} and {it:delta} for each iteration.
{dlgtab:Save additional results}
{phang}
{opt savemod(string)} saves all estimated coefficients of the model into file {it:string}.dta.
In the saved file, column iter indicates the iteration number and column estnb the cestimation number for the corresponding iteration, followed by the coefficient estimates.
Note that {opt nbgrp} * {nbiter} estimation results will be saved if this option is specified.
{phang}
{opt savepred(string)} saves all predictions along with the observation and group allocation ids, iteration and estimation numbers,
and dependent variable into file {it:string}.dta.
Note that {opt _N} * {opt nbgrp} * {opt nbiter} predicted values will be saved if this option is specified.
{phang}
{opt procnb(integer)} add processor number {it:integer} in the name of temporary datasets as follows: ___{cmd:procnb}_tempdata*.
By avoiding potential conflicts, this allows safe multiprocessing.
Default value is 1. Only positive integers are allowed.
{title:Stored results}
{pstd}
{cmd:overfit} saves the following in {cmd:r()}:
{synoptset 22 tabbed}{...}
{p2col 5 20 24 2: Scalars}{p_end}
{synopt:{cmd:r(missingvalues)}}equals 1 if at least one iteration is missing, and 0 otherwise{p_end}
{synopt:{cmd:r(nbcrashes)}}number of crashes encountered when estimating the model, {it:gamma} and {it:alpha}{p_end}
{synoptset 22 tabbed}{...}
{p2col 5 20 24 2: Matrices}{p_end}
{synopt:{cmd:r(shrinkage_iter)}} estimated shrinkage statistics for each iteration.{p_end}
{synopt:{cmd:r(shrinkage_mean)}} average shrinkage statistics with their standard error in line 2.
In case crashes occur, either during model estimations or during the estimation of {it:delta} and {it:alpha},
the average shrinkage statistics are computed using all iterations free of crashes,
and line 3 displays the number of observations taken into account.{p_end}
{synopt:{cmd:r(crashes)}} shows, for each iteration, how many crashes occurred during the estimation of the model, {it:delta} and {it:alpha}.{p_end}
{title:Examples}
{pstd}Import simulated example-dataset{p_end}
{phang2}{cmd:. net install overfit}{p_end}
{phang2}{cmd:. net get overfit}{p_end}
{pstd}Calculate the shrinkage statistics for a log-gamma GLM using the default option values{p_end}
{phang2}{cmd:. glm healthexp gender age logincome shi phi, link(log) family(Gamma)}{p_end}
{phang2}{cmd:. overfit: glm healthexp gender age logincome shi phi, link(log) family(Gamma)}{p_end}
{phang2}In this example, deviations above (below) the mean are underestimated (overestimated) by 6.8% when the estimated model is used to predict (or forecast) new outcomes.{p_end}
{phang2}Out-of-sample shrinkage mainly comes from in-sample mispecification (4.8%) as overfitting causes less shrinkage (2.1%).{p_end}
{pstd}Example with more flexible age specification and additional covariates{p_end}
{phang2}{cmd:. glm healthexp gender age age2 age3 age4 age5 logincome shi phi diabetes hypertension obesity cholesterol smoker drinker inactive, link(log) family(Gamma)}{p_end}
{phang2}{cmd:. overfit: glm healthexp gender age age age2 age3 age4 age5 logincome shi phi diabetes hypertension obesity cholesterol smoker drinker inactive, link(log) family(Gamma)}{p_end}
{phang2}The result is an increase in out-of-sample predictive bias mostly caused by an increase in overfitting.{p_end}
{pstd}Examples with a wrong specification of income with increasing flexibility{p_end}
{phang2}{cmd:. glm healthexp gender age income income2 shi phi, link(log) family(Gamma)}{p_end}
{phang2}{cmd:. overfit: glm healthexp gender age income income2 shi phi, link(log) family(Gamma)}{p_end}
{phang2}{cmd:. glm healthexp gender age income income2 income3 shi phi, link(log) family(Gamma)}{p_end}
{phang2}{cmd:. overfit: glm healthexp gender age income income2 income3 shi phi, link(log) family(Gamma)}{p_end}
{phang2}{cmd:. glm healthexp gender age income income2 income3 income4 shi phi, link(log) family(Gamma)}{p_end}
{phang2}{cmd:. overfit: glm healthexp gender age income income2 income3 income4 shi phi, link(log) family(Gamma)}{p_end}
{phang2}{cmd:. glm healthexp gender age income income2 income3 income4 income5 shi phi, link(log) family(Gamma)}{p_end}
{phang2}{cmd:. overfit: glm healthexp gender age income income2 income3 income4 income5 shi phi, link(log) family(Gamma)}{p_end}
{phang2}One can see in-sample bias decreasing and overfitting increasing as the polynomial degree of income increases.{p_end}
{phang2}Among these polynomial specifications of income, the degree 3 polynomial is the best as it has the lowest out-of-sample bias.{p_end}
{pstd}Example with an inadequate link function{p_end}
{phang2}{cmd:. glm healthexp gender age logincome shi phi, link(power 0.5) family(Gamma)}{p_end}
{phang2}{cmd:. overfit: overfit: glm healthexp gender age logincome shi phi, link(power 0.5) family(Gamma)}{p_end}
{phang2}In this example, overfitting is not the main problem and a better in-sample specification is warranted.{p_end}
{pstd}More precise estimation of the shrinkage statistics{p_end}
{phang2}{cmd:. overfit, nbiter(250) efficient: glm healthexp gender age logincome shi phi, link(log) family(Gamma)}{p_end}
{pstd}Example of 2-fold cross-validation{p_end}
{phang2}{cmd:. overfit, nbgrp(2): glm healthexp gender age logincome shi phi, link(log) family(Gamma)}{p_end}
{phang2}Note the larger standard errors.{p_end}
{pstd}Save histograms of the shrinkage statistics as well as the estimated models and predictions for each iteration and each group{p_end}
{phang2}{cmd:. overfit, nbiter(50) hist(myhist) savemod(mymodels) savepred(mypredictions): glm healthexp gender age logincome shi phi, link(log) family(Gamma)}{p_end}
{title:Reference}
{pstd}
Bilger M. and W.G. Manning, 2015.
Measuring overfitting in nonlinear models: A new method and an application to health expenditures.
Health Economics 24(1), 75-85.
{pstd}
{it: To Will Manning (1946-2014), a truly wonderful human being and outstanding health economist.}
{title:Author}
{pstd}
{browse "http://www.duke-nus.edu.sg/content/bilger-marcel":Marcel Bilger}, Laboratory of Health Econometrics, Signature Program in Health Services and Systems Research, Duke-NUS Graduate Medical School, Singapore.
Email {browse "mailto:marcel.bilger@duke-nus.edu.sg":marcel.bilger@duke-nus.edu.sg} with "overfit.ado" as subject if you have any question, comment or suggestion regarding this stata user-written command.