Template-Type: ReDIF-Article 1.0 Author-Name: Matthias Schonlau Author-Workplace-Name: University of Waterloo Author-Email: schonlau@uwaterloo.ca Author-Name: Rosie Yuyan Zou Author-Workplace-Name: University of Waterloo Author-Email: y53zou@uwaterloo.ca Title: The random forest algorithm for statistical learning Journal: Stata Journal Pages: 3-29 Issue: 1 Volume: 20 Year: 2020 Month: March X-DOI: 10.1177/1536867X20909688 Abstract: Random forests (Breiman, 2001, Machine Learning 45: 5–32) is a statistical- or machine-learning algorithm for prediction. In this article, we intro- duce a corresponding new command, rforest. We overview the random forest algorithm and illustrate its use with two examples: The first example is a clas- sification problem that predicts whether a credit card holder will default on his or her debt. The second example is a regression problem that predicts the log- scaled number of shares of online news articles. We conclude with a discussion that summarizes key points demonstrated in the examples. Keywords: rforest, random decision forest algorithm File-URL: http://hdl.handle.net/10.1177/1536867X20909688 Note: to access software from within Stata, net describe http://www.stata-journal.com/software/sj20-1/st0587/ Handle:RePEc:tsj:stataj:v:20:y:2020:i:1:p:3-29 Template-Type: ReDIF-Article 1.0 Author-Name: John Luke Gallup Author-Workplace-Name: Portland State University Author-Email: jlgallup@pdx.edu Title: Added-variable plots for panel-data estimation Journal: Stata Journal Pages: 30-50 Issue: 1 Volume: 20 Year: 2020 Month: March X-DOI: 10.1177/1536867X20909689 Abstract: In this article, I extend the theory of added-variable plots to three panel-data estimation methods: fixed effects, between effects, and random effects. An added-variable plot is an effective way to show the correlation between an independent variable and a dependent variable conditional on other independent variables. In a multivariate context, a simple scatterplot showing x versus y is not adequate to show the relationship of x with y, because it ignores the impact of the other covariates. Added-variable plots are also useful for spotting influential outliers in the data that affect the estimated regression parameters. Stata can display added-variable plots with the command avplot, but it can be used only after regress. My new command, xtavplot, is a postestimation command that creates added-variable plots after xtreg estimates. Unlike avplot, xtavplot can display a confidence interval around the fitted regression line. Keywords: xtavplot, xtavplots, added-variable plot, panel data, postestimation diagnostics, xtreg File-URL: http://hdl.handle.net/10.1177/1536867X20909689 Note: to access software from within Stata, net describe http://www.stata-journal.com/software/sj20-1/gr0082/ Handle:RePEc:tsj:stataj:v:20:y:2020:i:1:p:30-50 Template-Type: ReDIF-Article 1.0 Author-Name: Fernando Rios-Avila Author-Workplace-Name: Levy Economics Institute of Bard College Author-Email: friosavi@levy.org Author-Person: pri214 Title: Recentered influence functions (RIFs) in Stata: RIF regression and RIF decomposition Journal: Stata Journal Pages: 51-94 Issue: 1 Volume: 20 Year: 2020 Month: March X-DOI: 10.1177/1536867X20909690 Abstract: Recentered influence functions (RIFs) are statistical tools popularized by Firpo, Fortin, and Lemieux (2009, Econometrica 77: 953–973) for analyzing unconditional partial effects on quantiles in a regression analysis framework (un- conditional quantile regressions). The flexibility and simplicity of these tools have opened the possibility to extend the analysis to other distributional statistics us- ing linear regressions or decomposition approaches. In this article, I introduce one function and two commands to facilitate the use of RIFs in the analysis of outcome distributions: rifvar() is an egen extension used to create RIFs for a large set of distributional statistics, rifhdreg facilitates the estimation of RIF regressions enabling the use of high-dimensional fixed effects, and oaxaca rif implements Oaxaca–Blinder decomposition analysis (RIF decompositions). Keywords: rifvar(), rifhdreg, rifsureg2, oaxaca rif, uqreg, recentered influence functions, unconditional partial effects, unconditional quantile regression, RIF regressions, distributional statistics, Oaxaca–Blinder, RIF decomposition File-URL: http://hdl.handle.net/10.1177/1536867X20909690 Note: to access software from within Stata, net describe http://www.stata-journal.com/software/sj20-1/st0588/ Handle:RePEc:tsj:stataj:v:20:y:2020:i:1:p:51-94 Template-Type: ReDIF-Article 1.0 Author-Name: Sergio Correia Author-Workplace-Name: Federal Reserve Board of Governors Author-Email: sergio.a.correia@frb.gov Author-Person: pco826 Author-Name: Paulo Guimarães Author-Workplace-Name: Banco de Portugal Author-Email: pfguimaraes@bportugal.pt Author-Person: pgu11 Author-Name: Tom Zylkin Author-Workplace-Name: University of Richmond Author-Email: tzylkin@richmond.edu Author-Person: pzy12 Title: Fast Poisson estimation with high-dimensional fixed effects Journal: Stata Journal Pages: 95-115 Issue: 1 Volume: 20 Year: 2020 Month: March X-DOI: 10.1177/1536867X20909691 Abstract: In this article, we present ppmlhdfe, a new command for estimation of (pseudo-)Poisson regression models with multiple high-dimensional fixed effects (HDFE). Estimation is implemented using a modified version of the iteratively reweighted least-squares algorithm that allows for fast estimation in the presence of HDFE. Because the code is built around the reghdfe package (Correia, 2014, Statistical Software Components S457874, Department of Economics, Boston Col- lege), it has similar syntax, supports many of the same functionalities, and benefits from reghdfe’s fast convergence properties for computing high-dimensional least- squares problems. Performance is further enhanced by some new techniques we introduce for accelerating HDFE iteratively reweighted least-squares estimation specifically. ppmlhdfe also implements a novel and more robust approach to check for the existence of (pseudo)maximum likelihood estimates. Keywords: ppmlhdfe, reghdfe, Poisson regression, high-dimensional fixed effects File-URL: http://hdl.handle.net/10.1177/1536867X20909691 Note: to access software from within Stata, net describe http://www.stata-journal.com/software/sj20-1/st0589/ Handle:RePEc:tsj:stataj:v:20:y:2020:i:1:p:95-115 Template-Type: ReDIF-Article 1.0 Author-Name: J. R. Lockwood Author-Workplace-Name: Educational Testing Service Author-Email: jrlockwood@ets.org Author-Name: Daniel F. McCaffrey Author-Workplace-Name: Educational Testing Service Author-Email: dmccaffrey@ets.org Title: Recommendations about estimating errors-in-variables regression in Stata Journal: Stata Journal Pages: 116-130 Issue: 1 Volume: 20 Year: 2020 Month: March X-DOI: 10.1177/1536867X20909692 Abstract: Errors-in-variables (EIV) regression is a standard method for consistent estimation in linear models with error-prone covariates. The Stata commands eivreg and sem both can be used to compute the same EIV estimator of the regression coefficients. However, the commands do not use the same methods to estimate the standard errors of the estimated regression coefficients. In this article, we use analysis and simulation to demonstrate that standard errors reported by eivreg are negatively biased under assumptions typically made in latent-variable modeling, leading to confidence interval coverage that is below the nominal level. Thus, sem alone or eivreg augmented with bootstrapped standard errors should be preferred to eivreg alone in most practical applications of EIV regression. Keywords: errors-in-variables regression, eivreg, sem, standard-error estimation File-URL: http://hdl.handle.net/10.1177/1536867X20909692 Note: to access software from within Stata, net describe http://www.stata-journal.com/software/sj20-1/st0590/ Handle:RePEc:tsj:stataj:v:20:y:2020:i:1:p:116-130 Template-Type: ReDIF-Article 1.0 Author-Name: Jonathan Cook Author-Workplace-Name: Public Company Accounting Oversight Board Author-Email: jacook@uci.edu Author-Name: Vikram Ramadas Author-Workplace-Name: Public Company Accounting Oversight Board Author-Email: vnramadas@ucdavis.edu Title: When to consult precision-recall curves Journal: Stata Journal Pages: 131-148 Issue: 1 Volume: 20 Year: 2020 Month: March X-DOI: 10.1177/1536867X20909693 Abstract: Receiver operating characteristic (ROC) curves are commonly used to evaluate predictions of binary outcomes. When there is a small percentage of items of interest (as would be the case with fraud detection, for example), ROC curves can provide an inflated view of performance. This can cause challenges in determining which set of predictions is better. In this article, we discuss the condi- tions under which precision-recall curves may be preferable to ROC curves. As an illustrative example, we compare two commonly used fraud predictors (Beneish’s [1999, Financial Analysts Journal 55: 24–36] M score and Dechow et al.’s [2011, Contemporary Accounting Research 28: 17–82] F score) using both ROC and precision-recall curves. To aid the reader with using precision-recall curves, we also introduce the command prcurve to plot them. Keywords: prcurve, precision-recall curves, classifier evaluation, ROC curves File-URL: http://hdl.handle.net/10.1177/1536867X20909693 Note: to access software from within Stata, net describe http://www.stata-journal.com/software/sj20-1/st0591/ Handle:RePEc:tsj:stataj:v:20:y:2020:i:1:p:131-148 Template-Type: ReDIF-Article 1.0 Author-Name: Koen Jochmans Author-Workplace-Name: University of Cambridge Author-Email: kj345@cam.ac.uk Author-Person: pjo240 Author-Name: Vincenzo Verardi Author-Workplace-Name: Université de Namur Author-Email: vverardi@unamur.be Author-Person: pve73 Title: A portmanteau test for serial correlation in a linear panel model Journal: Stata Journal Pages: 149-161 Issue: 1 Volume: 20 Year: 2020 Month: March X-DOI: 10.1177/1536867X20909695 Abstract: We introduce the command xtserialpm to perform the portmanteau test developed in Jochmans (2019, Cambridge Working Papers in Economics No. 1993, University of Cambridge, Faculty of Economics). The procedure tests for serial correlation of arbitrary form in the errors of a linear panel model af- ter estimation of the regression coefficients by the within-group estimator. The test is designed for short panels and can deal with general missing-data patterns. The test is different from the related portmanteau test of Inoue and Solon (2006, Econometric Theory 22: 835–851), which is performed by xtistest (Wursten, 2018, Stata Journal 18: 76–100), in that it allows for heteroskedasticity. In sim- ulations documented below, xtserialpm is found to provide a more powerful test than xthrtest (Wursten 2018), which performs the test for first-order autocorre- lation of Born and Breitung (2016, Econometric Reviews 35: 1290–1316). We also provide comparisons with xtistest and xtserial (Drukker, 2003, Stata Journal 3: 168–177). These tests perform well under stationarity but break down under even mild forms of heteroskedasticity. Keywords: xtserialpm, heteroskedasticity, fixed-effects model, portmanteau test, serial correlation, short panel data, unbalanced panel File-URL: http://hdl.handle.net/10.1177/1536867X20909695 Note: to access software from within Stata, net describe http://www.stata-journal.com/software/sj20-1/st0592/ Handle:RePEc:tsj:stataj:v:20:y:2020:i:1:p:149-161 Template-Type: ReDIF-Article 1.0 Author-Name: Ariel Linden Author-Workplace-Name: Linden Consulting Group, LLC Author-Email: alinden@lindenconsulting.org Author-Person: pli1113 Author-Name: Maya B. Mathur Author-Workplace-Name: Harvard University Author-Email: mmathur@stanford.edu Author-Name: Tyler J. VanderWeele Author-Workplace-Name: tvanderw@hsph.harvard.edu Author-Email: mfdicle@gmail.com Title: Conducting sensitivity analysis for unmeasured confounding in observational studies using E-values: The evalue package Journal: Stata Journal Pages: 162-175 Issue: 1 Volume: 20 Year: 2020 Month: March X-DOI: 10.1177/1536867X20909696 Abstract: In this article, we introduce the evalue package, which performs sensitivity analyses for unmeasured confounding in observational studies using the methodology proposed by VanderWeele and Ding (2017, Annals of Inter- nal Medicine 167: 268–274). evalue reports E-values, defined as the minimum strength of association on the risk-ratio scale that an unmeasured confounder would need to have with both the treatment assignment and the outcome to fully explain away a specific treatment-outcome association, conditional on the mea- sured covariates. evalue computes E-values for point estimates (and optionally, confidence limits) for several common outcome types, including risk and rate ra- tios, odds ratios with common or rare outcomes, hazard ratios with common or rare outcomes, standardized mean differences in outcomes, and risk differences. Keywords: evalue, E-value, sensitivity analysis, treatment effects, causality, confounding File-URL: http://hdl.handle.net/10.1177/1536867X20909696 Note: to access software from within Stata, net describe http://www.stata-journal.com/software/sj20-1/st0593/ Handle:RePEc:tsj:stataj:v:20:y:2020:i:1:p:162-175 Template-Type: ReDIF-Article 1.0 Author-Name: Achim Ahrens Author-Workplace-Name: ETH Zürich Author-Email: achim.ahrens@gess.ethz.ch Author-Person: pah173 Author-Name: Christian B. Hansen Author-Workplace-Name: University of Chicago Author-Email: christian.hansen@chicagobooth.edu Author-Person: pha982 Author-Name: Mark E. Schaffer Author-Workplace-Name: Heriot-Watt University Author-Email: m.e.schaffer@hw.ac.uk Author-Person: psc51 Title: lassopack: Model selection and prediction with regularized regression in Stata Journal: Stata Journal Pages: 176-235 Issue: 1 Volume: 20 Year: 2020 Month: March X-DOI: 10.1177/1536867X20909697 Abstract: In this article, we introduce lassopack, a suite of programs for regularized regression in Stata. lassopack implements lasso, square-root lasso, elastic net, ridge regression, adaptive lasso, and postestimation ordinary least squares. The methods are suitable for the high-dimensional setting, where the number of predictors p may be large and possibly greater than the number of observations, n. We offer three approaches for selecting the penalization (“tuning”) parame- ters: information criteria (implemented in lasso2), K-fold cross-validation and h-step-ahead rolling cross-validation for cross-section, panel, and time-series data (cvlasso), and theory-driven (“rigorous” or plugin) penalization for the lasso and square-root lasso for cross-section and panel data (rlasso). We discuss the theo- retical framework and practical considerations for each approach. We also present Monte Carlo results to compare the performances of the penalization approaches. Keywords: lasso2, cvlasso, rlasso, cvlassologit, lassologit, rlassologit, lasso2 postestimation, lassologit postestimation, rlasso postestimation, lasso, elastic net, square-root lasso, cross-validation File-URL: http://hdl.handle.net/10.1177/1536867X20909697 Note: to access software from within Stata, net describe http://www.stata-journal.com/software/sj20-1/st0594/ Handle:RePEc:tsj:stataj:v:20:y:2020:i:1:p:176-235 Template-Type: ReDIF-Article 1.0 Author-Name: Nicholas J. Cox Author-Workplace-Name: Durham University Author-Email: n.j.cox@durham.ac.uk Author-Person: pco34 Title: Speaking Stata: Concatenating values over observations Journal: Stata Journal Pages: 236-243 Issue: 1 Volume: 20 Year: 2020 Month: March X-DOI: 10.1177/1536867X20909698 Abstract: Concatenation, or joining together, of strings or other values, possibly with extra punctuation such as spaces, is supported in Stata by addition of strings and by the egen function concat(), which concatenates values of variables within observations. In this column, I discuss basic techniques for concatenating values of variables over observations, emphasizing simple loops that can be tuned to suit variants as desired. Commonly, such concatenated strings report a profile or history of each individual within panel or longitudinal data. Such histories can then be analyzed further. Keywords: concatenation, strings, panel data, longitudinal data File-URL: http://hdl.handle.net/10.1177/1536867X20909698 Note: to access software from within Stata, net describe http://www.stata-journal.com/software/sj20-1/pr0071/ Handle:RePEc:tsj:stataj:y:20:y:2020:i:1:p:236-243 Template-Type: ReDIF-Article 1.0 Author-Name: Maarten L. Buis Author-Workplace-Name: University of Konstanz Author-Email: maarten.buis@uni-konstanz.de Author-Person: pbu92 Title: Stata tip 135: Leaps and bounds Journal: Stata Journal Pages: 244-249 Issue: 1 Volume: 20 Year: 2020 Month: March X-DOI: 10.1177/1536867X20909707 File-URL: http://hdl.handle.net/10.1177/1536867X20909707 Note: to access software from within Stata, net describe http://www.stata-journal.com/software/sj20-1/pr0071/ Handle:RePEc:tsj:stataj:y:20:y:2020:i:1:p:244-249 Template-Type: ReDIF-Article 1.0 Author-Name: Editors Author-Email: editors@stata.com Title: Software updates Journal: Stata Journal Pages: 250-251 Issue: 1 Volume: 20 Year: 2020 Month: March Abstract: Updates for previously published packages are provided. Note: to access software from within Stata, net describe http://www.stata-journal.com/software/sj20-1/st0399_1/ Note: to access software from within Stata, net describe http://www.stata-journal.com/software/sj20-1/st0526_1/ Note: to access software from within Stata, net describe http://www.stata-journal.com/software/sj20-1/st0574_1/ Note: Windows users should not attempt to download these files with a web browser. Handle:RePEc:tsj:stataj:v:20:y:2020:i:1:p:250-251