{smcl} {right:version 1.6.7 7.May.2022} {title:} {phang} {cmd:cvauroc} {hline 2} Cross-validated Area Under the Curve for ROC Analysis after Predictive Modelling for Binary Outcomes {title:Syntax} {p 4 4 2} {cmd: cvauroc} {depvar} {varlist} [if] [pw] [{cmd:,} Kfold() Seed() Probit Fit Detail graph graphlowess Detail] {p_end} {title:Description} {p 4 4 2} Receiver operating characteristic (ROC) analysis is used for comparing predictive models, both in model selection and model evaluation. This method is often applied in clinical medicine and social science to assess the tradeoff between model sensitivity and specificity. After fitting a binary logistic regression model with a set of independent variables, the predictive performance of this set of variables - as assessed by the area under the curve (AUC) from a ROC curve - must be estimated for a sample (the 'test' sample) that is independent of the sample used to predict the dependent variable (the 'training' sample). An important aspect of predictive modeling (regardless of model type) is the ability of a model to generalize to new cases. Evaluating the predictive performance (AUC) of a set of independent variables using all cases from the original analysis sample tends to result in an overly optimistic estimate of predictive performance. K-fold cross-validation can be used to generate a more realistic estimate of predictive performance. To assess this ability in situations in which the number of observations is not very large, {hi:cross-validation} and {hi:bootstrap} strategies are useful. {hi:cvauroc} is a Stata rclass program that implements k-fold cross-validation for the AUC for a binary outcome after fitting a logit or probit regression model. {hi:cvauroc} averages the AUCs corresponding to each fold and applies the bootstrap procedure to the cross-validated AUC to obtain statistical inference and 95% bias corrected confidence intervals (CI). Furthermore, {hi:cvauroc} optionally provides the cross-validated fitted probabilities for the dependent variable or outcome contianed in a new variable named {hi:_fit}, the sensitivity and specificity, contained in two new variables named, {hi:_sen} and {hi:_spe}, and the plot for the mean cvAUC and k-fold ROC curves. {title:Options} {p 4 4 2} {bf:pw} This option allows the user to include sampling weights (e.g. inverse-probability of censoring or treatment weights -IPCW or IPTW-). {p_end} {p 4 4 2} {bf:Kfold} This option allows the user to set the number of random folds to an integer greater or equal than 0 (default = 10). {p_end} {p 4 4 2} {bf:Seed} This option allows the user to set the random seed to an integer greater than 1 (default = 7777). {p_end} {p 4 4 2} {bf:Probit} This option allows the user to fit a probit rather than a logit model (default). {p_end} {p 4 4 2} {bf:Fit} This option allows the user to generate a new variable (_fit) containing the cross-validated probabilities for the dependent variable or outcome. {p_end} {p 4 4 2} {bf:Detail} This option allows the user to tabulate the prevalence of the outcome, the sensitivity, specificity and false positive values by each level of the outcome fitted probabilities. Furthermore, it creates two new variables containing the cross-validated sensitivity (_Sen) and specificity (_Spe) for the independent variable or predictor. {p_end} {p 4 4 2} {bf:Graph} This option allows the user to graph the empirical cross-validated ROC curves for the respective k folds specified by the user. {p_end} {p 4 4 2} {bf:Graphlowess} This option allows the user to graph a smoothed version of the mean cross-validated ROC curve and the empirical ROC curves for the respective k folds specified by the user. {p_end} {title:Example} . use http://www.stata-press.com/data/r14/cattaneo2.dta (Excerpt from Cattaneo (2010) Journal of Econometrics 155: 138-154) . gen lbw = cond(bweight<2500,1,0.) . cvauroc lbw mage medu mmarried prenatal fedu mbsmoke mrace order, kfold(10) seed(1972) probit fit det 1-fold (N=465).........AUC = 0.726 2-fold (N=464).........AUC = 0.752 3-fold (N=464).........AUC = 0.660 4-fold (N=464).........AUC = 0.621 5-fold (N=464).........AUC = 0.703 6-fold (N=465).........AUC = 0.742 7-fold (N=464).........AUC = 0.579 8-fold (N=464).........AUC = 0.641 9-fold (N=464).........AUC = 0.730 10-fold(N=464).........AUC = 0.704 Model:probit Seed:1972 ---------------------------------------------------------------- Cross-validated (cv) mean AUC, SD and Bootstrap Corrected 95%CI ---------------------------------------------------------------- cvMean AUC: | 0.6857 Booststrap corrected 95%CI: | 0.6348, 0.7079 cvSD AUC: | 0.0578 ---------------------------------------------------------------- ------------------------------------------------------------------ Mean cross-validated Sen, Spe and false(+) at lbw predicted values ------------------------------------------------------------------ Prevalence of lbw: 6.01% ------------------------ _Pp | _sen _spe _fp -----+------------------------------ 0.01 | 99.83 0.52 99.48 0.02 | 98.21 4.09 95.91 0.03 | 88.82 20.12 79.88 0.04 | 73.69 49.21 50.79 0.05 | 66.05 64.58 35.42 0.06 | 60.17 69.31 30.69 (output omitted ...) 0.39 | 0.27 99.98 0.02 ------------------------------------------ ******************************************************* * Naive performance based on non-crossvalidated AUC * ******************************************************* . logistic lbw mage medu mmarried prenatal fedu mbsmoke mrace order Logistic regression Number of obs = 4,642 LR chi2(8) = 137.10 Prob > chi2 = 0.0000 Log likelihood = -986.35435 Pseudo R2 = 0.0650 ------------------------------------------------------------------------------ lbw | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- mage | .9959165 .0140441 -0.29 0.772 .9687674 1.023826 medu | .9451338 .0283732 -1.88 0.060 .8911276 1.002413 mmarried | .6109995 .1014788 -2.97 0.003 .4412328 .8460849 prenatal | .5886787 .073186 -4.26 0.000 .4613759 .7511069 fedu | 1.040936 .0214226 1.95 0.051 .9997838 1.083782 mbsmoke | 2.145619 .3055361 5.36 0.000 1.623086 2.836376 mrace | .3789501 .057913 -6.35 0.000 .2808648 .5112895 order | 1.05529 .0605811 0.94 0.349 .9429895 1.180964 _cons | .3468141 .1498299 -2.45 0.014 .1487176 .8087812 ------------------------------------------------------------------------------ . predict fitted, pr . roctab lbw fitted ROC -Asymptotic Normal-- Obs Area Std. Err. [95% Conf. Interval] ------------------------------------------------------------ 4,642 0.6939 0.0171 0.66041 0.72749 {title:Authors} {p 4 4 2} Miguel Angel Luque-Fernandez {break} LSHTM, NCDE, Cancer Survival Group, London, UK {break} Email: miguel-angel.luque@lshtm.ac.uk {break} {p 4 4 2} Camille Maringe {break} LSHTM, NCDE, Cancer Survival Group, London, UK {break} Email: camille.maringe at lshtm.ac.uk {break} {p 4 4 2} Daniel Redondo-Sanchez {break} Biomedical Research Institute of Granada (ibs.Granada) {break} Email: daniel.redondo.easp at juntadeandalucia.es {break} {title:Acknowledgements} {p 4 4 2} Miguel Angel Luque Fernandez is supported by the Spanish National Institute of Health, Carlos III Miguel Servet I Investigator Award (CP17/00206). {title:References} {p 4 4 2} Luque-Fernandez MA, Redondo-Sánchez D, Maringe C. cvauroc: Command to compute cross-validated area under the curve for ROC analysis after predictive modeling for binary outcomes. The Stata Journal. 2019;19(3):615-625. doi:10.1177/1536867X19874237 {p_end} {p 4 4 2} Miguel Angel Luque-Fernandez (2016), Crossvalidation in Epidemiology {browse "http://scholar.harvard.edu/malf/presentations/cross-validation-epidemiology": Presentation} {p_end} {p 4 4 2} StataCorp. 2015. Stata Statistical Software: Release 14. College Station, TX: StataCorp LP. {p_end} {p 4 4 2} Hastie T., Tibshirani R., Friedman J., (2013). The elements of Statistical Learning, Data Mining, Inference and Prediction. Springer Series in Statistics. {p_end} {title:Also see} {psee} Online: {helpb crossfold} {helpb roctab} {helpb lsens} {helpb lroc} {helpb rocreg} {p_end}