Fitting Generalized Estimating Equation (GEE) Regression Models in Stata Nicholas Horton Boston University School of Public Health horton@bu.edu 1.1 Researchers are often interested in analyzing data which arise from a longitudinal or clustered design. While there are a variety of standard likelihood-based approaches to analysis when the outcome variables are approximately multivariate normal, models for discrete-type outcomes generally require a different approach. Liang and Zeger formalized an approach to this problem using Generalized Estimating Equations (GEEs) to extend Generalized Linear Models (GLMs) to a regression setting with correlated observations within subjects. In this talk, I will briefly review the GEE methodology, introduce some examples, and provide a tutorial on how to fit models using "xtgee" in Stata. 12 March 2001 0900-1030 Estimation and fitting The Quadratic Assignment Procedure (QAP) William Simpson Harvard Business School wsimpson@hbs.edu 1.2 Some data sets contain observations corresponding to pairs of entities (people, companies, countries, etc.). Conceptually, each observation corresponds to a cell in a square matrix, where the rows and columns are labelled by the entities. For example, consider a square matrix where the rows and columns are the 50 U.S. states. Each observation would contain numbers such as the distance between the pair of states, exports from one state to the other, etc. The observations are not independent, so estimation procedures designed for independent observations will calculate incorrect standard errors. The quadratic assignment procedure (QAP), which is commonly used in social network analysis, is a resampling-based method, similar to the bootstrap, for calculating the correct standard errors. This talk explains the QAP algorithm and describes the -qap- command, with syntax similar to -bstrap- command, which implements the quadratic assignment procedure and allows running any estimation command using QAP samples. 12 March 2001 0900-1030 Estimation and fitting The Normal Mixture Decomposition Stanislav Kolenikov University of North Carolina at Chapel Hill skolenik@unc.edu 1.3 This talk will present the program for univariate normal mixture maximum likelihood estimation developed by the author. It will demonstrate the use of -ml lf- estimation method, as well as a number of programming tricks, including global macros manipulation and dynamic definition of the program to be used by -ml-. The merits and limitations of Stata's -ml- optimizer will be discussed. The application to income distribution analysis with a real data set will also be shown. 12 March 2001 0900-1030 Estimation and fitting Post-estimation commands for regression models for categorical and count outcomes Jeremy FreeseJ. Scott Long University of WisconsinUniversity of Indiana jfreese@ssc.wisc.edujslong@indiana.edu 2.1 Although Stata has made estimating regression models for categorical and count outcomes virtually as fast and easy as estimating the familiar regression model for continuous outcomes, interpreting the results from the former is complicated by the nonlinear relationship between the independent variables and the dependent quantities of interest (i.e., predicted probabilities and predicted counts). As a consequence, the change in the predicted value associated with a unit change in the independent variable depends on the specific values of all of the independent variables. We have developed a series of tools that are intended to facilitate the effective use and interpretation of these models. Our command -listcoef- presents lists of different types of transformed coefficients from these models, and also provides a guide to their interpretation. A suite of commands, known collectively as -pr*-, computes predicted values and the discrete change for specified values of the independent variables. Our command -fitstat- computes a large number of goodness-of-fit statistics. Specifically for the multinomial logit model, the command -mlogtest- performs a number of commonly desired tests, and -mlogview- creates discrete change and/or odds ratio plots. 12 March 2001 1100-1200 Model testing Testing for omitted variables Jeroen Weesie Utrecht University J.Weesie@fss.uu.nl 2.2 Testing for omitted variables should play an important part in specification analyses of statistical "linear form" models. Such omissions may comprise terms in variables that were included themselves (e.g., a quadratic term, or a categorical specification instead of a metric one), interactions between variables in the model, and variables that were left out to begin with. Re-estimating models with additional variables and performing (for example) likelihood ratio tests is time-consuming. Score tests provide an attractive alternative, since the tests can be computed using only results from the model already estimated. We present a Stata command for performing score testing after most Stata estimation commands (e.g., logit, heckman, streg etc.). This command supports multiple-equation models, clustered observations, and adjusted p-values for simultaneous testing. 12 March 2001 1100-1200 Model testing Computing Variances from Data with Complex Sampling Designs: A Comparison of Stata and SPSS Alicia C. DowdMichael B. Duggan Univ. Mass. Boston, Graduate College of EducationSuffolk University alicia.dowd@umb.edumduggan@admin.suffolk.edu 3.1 Most of the data sets available through the National Center for Education Statistics (NCES) are based on complex sampling designs involving multi-stage sampling, stratification, and clustering. These complex designs require appropriate statistical techniques to calculate the variance. Stata employs specialized methods that appropriately adjust for the complex designs, while SPSS does not. Researchers using SPSS must obtain the design effects through NCES and adjust the standard errors generated by SPSS with these values. This presentation addresses the pros and cons of recommending Stata or SPSS to novice researchers. The first presenter teaches research methods to doctoral students and uses Stata to conduct research with NCES data. She uses SPSS to teach her research methods course, due to its user-friendly interface. The second presenter is a doctoral student conducting dissertation research with NCES data. In his professional life as an institutional researcher, he uses SPSS. NCES data sets are a rich resource, but the complex sampling designs create conceptual issues beyond the immediate grasp of most doctoral candidates in the field. The session considers and invites comment on the best approaches to introducing new researchers to complex sampling designs in order to enable them to use NCES data. 12 March 2001 1330-1500 Survey and multilevel data analysis svytabs: A program for producing complex survey tables Michael Blasnik Blasnik & Associates mblasnik@110.net 3.2 Stata's svytab command is quite limited because tables that users need to produce for reports often involve extracting a single point estimate (and standard error, confidence intervals, or p-value) from each of dozens or hundreds of svytab commands. Svytabs was designed to produce these tables directly. It sets up and performs many svytab commands and grabs the appropriate output to create formatted tables ready to export to word processor or spreadsheet. The added features include: 1) allows a full varlist for the rowvar if they are dichotomous (sequencing through and grabbing the estimate of interest from each); 2) allows either dichotomous or multi-valued rowvars (if multivalued then varlist is restricted to one); 3) allows multiple subpops and cycles through them; 4) doesn't require -- but allows --a column var (allowing subpops to substitute); 5) formats the output into a log file for exporting as CSV (with table titling options); 6) uses characteristics to provide "nice" naming of rows and columns; 7) provides options for outputting standard errors, confidence intervals, asterisking significance levels, deff, etc... I think anyone producing complex survey tables would find svytabs quite useful. 12 March 2001 1330-1500 Survey and multilevel data analysis Simple Cases of Multi-Level Models Rich Goldstein richgold@ix.netcom.com 3.3 While much has been made of multi-level models, and specialized software for such models, in many cases standard methods can be used in estimating these models. Use of such standard methods is faster and easier, in many cases, than use of specialized software; futher, use of standard methods helps clarify what these models actually are estimating. I limit my discussion here to linear regression models and include a new ado file that puts together the steps to match multi-level models, in certain cases. If time allows, a comparison with the much slower gllamm6, for these limited situations, will be briefly presented. 12 March 2001 1330-1500 Survey and multilevel data analysis Date and Time Tags for Filenames in WinXX Harriet E Griesinger Wellesley Child Care Research Partnership hgriesin@Wellesley.edu 4.1 I receive several (ir)regular deliveries of data files for the on-going development of a panel data set. Both the delivering agency systems and the targets of our research group change over time -- by the hour and/or by the year. I need to be able to identify from the filenames which Stata .dta files were created with which.do files leaving which .log files. I use the Stata shell facility and DOS rename to attach an ado generated global macro date-tag and global macro hour-minute-tag. 12 March 2001 1530-1715 Longitudinal data analysis Efficient Management of Multi-Frequency Panel Data with Stata Christopher F Baum Boston College baum@bc.edu 4.2 This presentation discusses how the tasks involved with carrying out a sizable research project, involving panel data at both monthly and daily frequencies, could be efficiently managed by making use of built-in and user-contributed features of Stata. The project entails the construction of a dataset of cross-country monthly measures for 18 nations, and the evaluation of bilateral economic activity between each distinct pair of countries. One measure of volatility, at a monthly frequency, is calculated from daily spot exchange rate data, and effectively merged back to the monthly dataset. Nonlinear least squares models are estimated for every distinct bilateral relationship, and the results of those 300+ models organized for further analysis and production of summary tables and graphics using a postfile. The various labor-saving techniques used to carry out this research will be discussed, with emphasis on the generality that allows additional countries, time periods, and data to be integrated with the panel dataset with ease. 12 March 2001 1530-1715 Longitudinal data analysis Challenges of Creating and Working with Cross-Year-Family-Individual Files: An Example from the PSID data set Petia Petrova Boston College petrova@bc.edu 4.3 Often researchers need to build longitudinal data sets in order to study individuals and families or firms and plants across time. No matter if individuals or firms are points of interest, the resulting matrix is no longer rectangular due to the changes in family or firm composition. Many times the data come into a different format and simply merging, on for example, family and person ID-s lead to wrong records. Here we are using the Panel Study of Income Dynamics to illustrate some of the pitfalls in creating a Cross-Year-Family-Individual File. In order to create a Cross-Year-Family-Individual file, one has to merge the family files with the individual files. As of 1990 the file format of PSID consists of single-year files with family-level data collected in each wave (i.e. 26 family files for data collected from 1968 through 1993) and one cross-year individual file with the individual-level data collected from 1968 to the most recent interviewing wave. Attaching family records to the individual ones, without taking into consideration splitoffs and movers in and out of the family, however, lead to some cases in which members of the same family appear to have different information for family income. The core of the problem is that some of the information reported in the interview year refers to the previous year. If a person is a splitoff, he reports, for example, the family income of the family he is currently in. This income then is incorrectly attached to his record of the previous year, when he was in a different family. We suggest a way to fix problems like this one. The idea is to extract separately all variables referring to the year previous to the year of the interview, and then using the splitoff indicator to attach them to the individualsÕ records. 12 March 2001 1530-1715 Longitudinal data analysis Analysis of Longitudinal Data in Stata, Splus and SAS Rino Bellocco Karolinska Institutet Rino.Bellocco@mep.ki.se 4.4 Longitudinal data are commonly collected in experimental and observational studies, where both disease and risk factors are measured at different repeated times. The goal of this project is to compare analyses performed using Stata, SPlus and SAS under two different families of distributions: normal and logistic. I will show the results obtained from the analyses of two sample data sets; these will analysed using both Generalized Estimating Equation (gee) and Random Effect models. In Stata I will use both the xt programs and the routine provided by Rabe-Hesketh (glamm6): confidence intervals, hypothesis testing and model fitting will be discussed. Missing data issues will be raised and discussed as well. 12 March 2001 1530-1715 Longitudinal data analysis Stata Teaching Tools Phil Ender UCLA Department of Education ender@ucla.edu 5.1 This presentation will cover a collection of statistics teaching tools written in Stata. These programs involve demonstrations or simulations of various statistical topics that are used both in the classroom and individually by the students. Topics include probability (coin, dice, box models), common probability distributions (normal, t, chi-square, F), sampling distributions, central limit theorem, confidence intervals, correlation, regression and other topics. These programs are currently being used in introductory and intermediate research methods courses being taught in the UCLA Department of Education. The presentation will conclude with a short review on my experiences using Stata in the classroom over the past two years. 13 March 2001 0900-1030 Assorted topics Three-Valued Logic Operations in Stata David Kantor Institute for Policy Studies, Johns Hopkins University dkantor@jhunix.hcf.jhu.edu 5.2 Stata uses numeric quantities as logical values, and provides logical operators (&, |, ~) to build expressions from basic entities. These operators can be regarded as faulty when missing values are present in the operands. In this context, missing is equivalent to true, which is often not the desired result. Instead, one may want to obtain the maximal set of nonmissing results for all combinations of operand values, while preserving the behavior of the operators on two-valued operands -- in other words, one should adopt three-valued logic. I have developed a set of egen functions that provide this capability. As such, they can only do one type of operation at a time, so that complex expressions would need to be built in stages. But they can be a great help when you wish to generate indicator variables and want the maximal set of nonmissing results. 13 March 2001 0900-1030 Assorted topics Analysing circular data in Stata Nicholas J. Cox University of Durham n.j.cox@durham.ac.uk 5.3 Circular data are a large class of directional data, which are of interest to scientists in many fields, including biologists (movements of migrating animals), meteorologists (winds), geologists (directions of joints and faults) and geomorphologists (landforms, oriented stones). Such examples are all recordable as compass bearings relative to North. Other examples include phenomena that are periodic in time, including daily and seasonal rhythms. The analysis of circular data is an odd corner of statistical science which many never visit, even though it has a long and curious history. Perhaps for that reason, it seems that no major statistical language provides direct support for circular statistics, although there is a commercially available special-purpose program called Oriana. This paper describes the development and use of some routines which have been written in Stata, primarily to allow graphical and exploratory analyses. They include commands for data management, summary statistics and significance tests, univariate graphics and bivariate relationships. The graphics routines were developed partly with -gph-. (By the time of the meeting, it may be possible to enhance these using new facilities in Stata 7.) Collectively they offer about as many facilities as does Oriana. 13 March 2001 0900-1030 Assorted topics Panel data analysis in Stata: An extended example David Drukker Stata Corporation ddrukker@stata.com 6.1 13 March 2001 1100-1200 The evolving nature of the Stata Technical Bulletin H.Joseph Newton Texas A & M University stb@stata.com 7.1 13 March 2001 1330-1530 Report to Users William W. Gould Stata Corporation wwg@stata.com 7.2 13 March 2001 1330-1530 Wishes and grumbles Christopher F Baum (moderator) Boston College and RePEc baum@bc.edu 8.1 13 March 2001 1600-1700